www.infradead.org Git - users/willy/linux.git/log

vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is n

fa5e084e43eb ("vmscan: do not unconditionally treat zones that fail
zone_reclaim() as full") changed the return value of node_reclaim(). The
original return value 0 means NODE_RECLAIM_SOME after this commit.

While the return value of node_reclaim() when CONFIG_NUMA is n is not
changed. This will leads to call zone_watermark_ok() again.

This patch fixes the return value by adjusting to NODE_RECLAIM_NOSCAN.
Since node_reclaim() is only called in page_alloc.c, move it to
mm/internal.h.

Link: http://lkml.kernel.org/r/20181113080436.22078-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm: remove managed_page_count_lock spinlock

Now that totalram_pages and managed_pages are atomic varibles, no need of
managed_page_count spinlock. The lock had really a weak consistency
guarantee. It hasn't been used for anything but the update but no reader
actually cares about all the values being updated to be in sync.

Link: http://lkml.kernel.org/r/1542090790-21750-5-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic-checkpatch-fixes

WARNING: 'lenght' may be misspelled - perhaps 'length'?
#7:
things.  It was discussed in lenght here,

WARNING: line over 80 characters
#252: FILE: drivers/md/dm-crypt.c:2161:
+ unsigned long pages = (totalram_pages() - totalhigh_pages()) * DM_CRYPT_MEMORY_PERCENT / 100;

WARNING: line over 80 characters
#263: FILE: drivers/md/dm-integrity.c:2846:
+ if (journal_pages >= totalram_pages() - totalhigh_pages() || journal_desc_size > ULONG_MAX) {

WARNING: line over 80 characters
#307: FILE: drivers/parisc/ccio-dma.c:1254:
+ iova_space_size = (u32) (totalram_pages() / count_parisc_driver(&ccio_driver));

WARNING: please, no spaces at the start of a line
#472: FILE: include/linux/highmem.h:47:
+       atomic_long_inc(&_totalhigh_pages);$

WARNING: please, no spaces at the start of a line
#477: FILE: include/linux/highmem.h:52:
+       atomic_long_dec(&_totalhigh_pages);$

WARNING: please, no spaces at the start of a line
#482: FILE: include/linux/highmem.h:57:
+       atomic_long_add(count, &_totalhigh_pages);$

WARNING: please, no spaces at the start of a line
#487: FILE: include/linux/highmem.h:62:
+       atomic_long_set(&_totalhigh_pages, val);$

WARNING: please, no spaces at the start of a line
#511: FILE: include/linux/mm.h:54:
+       return (unsigned long)atomic_long_read(&_totalram_pages);$

WARNING: please, no spaces at the start of a line
#516: FILE: include/linux/mm.h:59:
+       atomic_long_inc(&_totalram_pages);$

WARNING: please, no spaces at the start of a line
#521: FILE: include/linux/mm.h:64:
+       atomic_long_dec(&_totalram_pages);$

WARNING: please, no spaces at the start of a line
#526: FILE: include/linux/mm.h:69:
+       atomic_long_add(count, &_totalram_pages);$

WARNING: please, no spaces at the start of a line
#531: FILE: include/linux/mm.h:74:
+       atomic_long_set(&_totalram_pages, val);$

WARNING: line over 80 characters
#722: FILE: mm/page_alloc.c:7288:
+ (physpages - totalram_pages() - totalcma_pages) << (PAGE_SHIFT - 10),

WARNING: Missing a blank line after declarations
#745: FILE: mm/shmem.c:118:
+ unsigned long nr_pages = totalram_pages();
+ return min(nr_pages - totalhigh_pages(), nr_pages / 2);

total: 0 errors, 15 warnings, 656 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

./patches/mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Arun KS <arunks@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm: convert totalram_pages and totalhigh_pages variables to atomic

totalram_pages and totalhigh_pages are made static inline function.

Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.

Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm: convert zone->managed_pages to atomic variable

totalram_pages, zone->managed_pages and totalhigh_pages updates are
protected by managed_page_count_lock, but readers never care about it.
Convert these variables to atomic to avoid readers potentially seeing a
store tear.

This patch converts zone->managed_pages. Subsequent patches will convert
totalram_panges, totalhigh_pages and eventually managed_page_count_lock
will be removed.

Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.

Link: http://lkml.kernel.org/r/1542090790-21750-3-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Suggested-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm: reference totalram_pages and managed_pages once per function

Patch series "mm: convert totalram_pages, totalhigh_pages and managed
pages to atomic", v5.

This series converts totalram_pages, totalhigh_pages and
zone->managed_pages to atomic variables.

totalram_pages, zone->managed_pages and totalhigh_pages updates are
protected by managed_page_count_lock, but readers never care about it.
Convert these variables to atomic to avoid readers potentially seeing a
store tear.

Main motivation was that managed_page_count_lock handling was complicating
things.  It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 It seemes better
to remove the lock and convert variables to atomic.  With the change,
preventing poteintial store-to-read tearing comes as a bonus.

This patch (of 4):

This is in preparation to a later patch which converts totalram_pages and
zone->managed_pages to atomic variables.  Please note that re-reading the
value might lead to a different value and as such it could lead to
unexpected behavior.  There are no known bugs as a result of the current
code but it is better to prevent from them in principle.

Link: http://lkml.kernel.org/r/1542090790-21750-2-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm/ksm.c: assist buddy allocator to assemble 1-order pages

try_to_merge_two_pages() merges two pages, one of them is a page of
currently scanned mm, the second is a page with identical hash from
unstable tree.  Currently, we merge the page from unstable tree into the
first one, and then free it.

The idea of this patch is to prefer freeing that page of them, which has a
free neighbour (i.e., neighbour with zero page_count()).  This allows
buddy allocator to assemble at least 1-order set from the freed page and
its neighbour; this is a kind of cheep passive compaction.

AFAIK, 1-order pages set consists of pages with PFNs [2n, 2n+1] (odd,
even), so the neighbour's pfn is calculated via XOR with 1.  We check the
result pfn is valid and its page_count(), and prefer merging into
@tree_page if neighbour's usage count is zero.

There a is small difference with current behavior in case of error path.
In case the second try_to_merge_with_ksm_page() fails, we return from
try_to_merge_two_pages() with @tree_page removed from the unstable tree.
It does not seem to matter, but if we do not want a change at all, it's
not a problem to move remove_rmap_item_from_tree() from
try_to_merge_with_ksm_page() to its callers.

Link: http://lkml.kernel.org/r/153995241537.4096.15189862239521235797.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jia He <jia.he@hxt-semitech.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm: remove reset of pcp->counter in pageset_init()

per_cpu_pageset is cleared by memset, it is not necessary to reset it
again.

Link: http://lkml.kernel.org/r/20181021023920.5501-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

selftests/memfd: modify tests for F_SEAL_FUTURE_WRITE seal

Modify the tests for F_SEAL_FUTURE_WRITE based on the changes
introduced in previous patch.

Also add a test to make sure the reopen issue pointed by Jann Horn [1]
is fixed.

[1] https://lore.kernel.org/lkml/CAG48ez1h=v-JYnDw81HaYJzOfrNhwYksxmc2r=cJvdQVgYM+NA@mail.gmail.com/

Link: http://lkml.kernel.org/r/20181120052137.74317-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Jann Horn <jannh@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: John Reck <jreck@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Lei Yang <Lei.Yang@windriver.com>
Cc: Marc-Andr Lureau <marcandre.lureau@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Todd Kjos <tkjos@google.com>
Cc: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

selftests/memfd: add tests for F_SEAL_FUTURE_WRITE seal

Add tests to verify sealing memfds with the F_SEAL_FUTURE_WRITE works as
expected.

Link: http://lkml.kernel.org/r/20181108041537.39694-2-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: John Reck <jreck@google.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Lei Yang <Lei.Yang@windriver.com>
Cc: Marc-Andr Lureau <marcandre.lureau@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Todd Kjos <tkjos@google.com>
Cc: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm-add-an-f_seal_future_write-seal-to-memfd-fix-2

v4

Link: http://lkml.kernel.org/r/20181122230906.GA198127@google.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm/memfd: make F_SEAL_FUTURE_WRITE seal more robust

A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week
where we don't need to modify core VFS structures to get the same behavior
of the seal. This solves several side-effects pointed out by Andy [2].

[1] https://lore.kernel.org/lkml/20181111173650.GA256781@google.com/
[2] https://lore.kernel.org/lkml/69CE06CC-E47C-4992-848A-66EB23EE6C74@amacapital.net/

Link: http://lkml.kernel.org/r/20181120052137.74317-1-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: John Reck <jreck@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Daniel Colascione <dancol@google.com>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Lei Yang <Lei.Yang@windriver.com>
Cc: Marc-Andr Lureau <marcandre.lureau@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

Android uses ashmem for sharing memory regions.  We are looking forward to
migrating all usecases of ashmem to memfd so that we can possibly remove
the ashmem driver in the future from staging while also benefiting from
using memfd and contributing to it.  Note staging drivers are also not ABI
and generally can be removed at anytime.

One of the main usecases Android has is the ability to create a region and
mmap it as writeable, then add protection against making any "future"
writes while keeping the existing already mmap'ed writeable-region active.
This allows us to implement a usecase where receivers of the shared
memory buffer can get a read-only view, while the sender continues to
write to the buffer.  See CursorWindow documentation in Android for more
details:

https://developer.android.com/reference/android/database/CursorWindow

This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
which prevents any future mmap and write syscalls from succeeding while
keeping the existing mmap active.

A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week
where we don't need to modify core VFS structures to get the same behavior
of the seal.  This solves several side-effects pointed by Andy.
self-tests are provided in later patch to verify the expected semantics.

[1] https://lore.kernel.org/lkml/20181111173650.GA256781@google.com/

Link: http://lkml.kernel.org/r/20181108041537.39694-1-joel@joelfernandes.org
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: John Reck <jreck@google.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Daniel Colascione <dancol@google.com>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Lei Yang <Lei.Yang@windriver.com>
Cc: Marc-Andr Lureau <marcandre.lureau@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, memory_hotplug: do not clear numa_node association after hot_remove

Per-cpu numa_node provides a default node for each possible cpu.  The
association gets initialized during the boot when the architecture
specific code explores cpu->NUMA affinity.  When the whole NUMA node is
removed though we are clearing this association

try_offline_node
  check_and_unmap_cpu_on_node
    unmap_cpu_on_node
      numa_clear_node
        numa_set_node(cpu, NUMA_NO_NODE)

This means that whoever calls cpu_to_node for a cpu associated with such a
node will get NUMA_NO_NODE.  This is problematic for two reasons.  First
it is fragile because __alloc_pages_node would simply blow up on an
out-of-bound access.  We have encountered this when loading kvm module

BUG: unable to handle kernel paging request at 00000000000021c0
IP: [<ffffffff8119ccb3>] __alloc_pages_nodemask+0x93/0xb70
PGD 800000ffe853e067 PUD 7336bbc067 PMD 0
Oops: 0000 [#1] SMP
[...]
CPU: 88 PID: 1223749 Comm: modprobe Tainted: G        W          4.4.156-94.64-default #1
task: ffff88727eff1880 ti: ffff887354490000 task.ti: ffff887354490000
RIP: 0010:[<ffffffff8119ccb3>]  [<ffffffff8119ccb3>] __alloc_pages_nodemask+0x93/0xb70
RSP: 0018:ffff887354493b40  EFLAGS: 00010202
RAX: 00000000000021c0 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 00000000014000c0
RBP: 00000000014000c0 R08: ffffffffffffffff R09: 0000000000000000
R10: ffff88fffc89e790 R11: 0000000000014000 R12: 0000000000000101
R13: ffffffffa0772cd4 R14: ffffffffa0769ac0 R15: 0000000000000000
FS:  00007fdf2f2f1700(0000) GS:ffff88fffc880000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000021c0 CR3: 00000077205ee000 CR4: 0000000000360670
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
0000000000000086 014000c014d20400 ffff887354493bb8 ffff882614d20f4c
0000000000000000 0000000000000046 0000000000000046 ffffffff810ac0c9
ffff88ffe78c0000 ffffffff0000009f ffffe8ffe82d3500 ffff88ff8ac55000
Call Trace:
[<ffffffffa07476cd>] alloc_vmcs_cpu+0x3d/0x90 [kvm_intel]
[<ffffffffa0772c0c>] hardware_setup+0x781/0x849 [kvm_intel]
[<ffffffffa04a1c58>] kvm_arch_hardware_setup+0x28/0x190 [kvm]
[<ffffffffa04856fc>] kvm_init+0x7c/0x2d0 [kvm]
[<ffffffffa0772cf2>] vmx_init+0x1e/0x32c [kvm_intel]
[<ffffffff8100213a>] do_one_initcall+0xca/0x1f0
[<ffffffff81193886>] do_init_module+0x5a/0x1d7
[<ffffffff81112083>] load_module+0x1393/0x1c90
[<ffffffff81112b30>] SYSC_finit_module+0x70/0xa0
[<ffffffff8161cbc3>] entry_SYSCALL_64_fastpath+0x1e/0xb7
DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xb7

on an older kernel but the code is basically the same in the current Linus
tree as well.  alloc_vmcs_cpu could use alloc_pages_nodemask which would
recognize NUMA_NO_NODE and use alloc_pages_node which would translate it
to numa_mem_id but that is wrong as well because it would use a cpu
affinity of the local CPU which might be quite far from the original node.
It is also reasonable to expect that cpu_to_node will provide a sane
value and there might be many more callers like that.

The second problem is that __register_one_node relies on cpu_to_node to
properly associate cpus back to the node when it is onlined.  We do not
want to lose that link as there is no arch independent way to get it from
the early boot time AFAICS.

Drop the whole check_and_unmap_cpu_on_node machinery and keep the
association to fix both issues.  The NODE_DATA(nid) is not deallocated so
it will stay in place and if anybody wants to allocate from that node then
a fallback node will be used.

Thanks to Vlastimil Babka for his live system debugging skills that helped
debugging the issue.

Link: http://lkml.kernel.org/r/20181108100413.966-1-mhocko@kernel.org
Fixes: e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Debugged-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Miroslav Benes <mbenes@suse.cz>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm/mmap.c: remove verify_mm_writelocked()

We should get rid of this function. It no longer serves its purpose.
This is a historical artifact from 2005 where do_brk was called outside of
the core mm. We do have a proper abstraction in vm_brk_flags and that one
does the locking properly so there is no need to use this function.

Link: http://lkml.kernel.org/r/20181108174856.10811-1-tiny.windzz@gmail.com
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm: select HAVE_MOVE_PMD on x86 for faster mremap

Moving page-tables at the PMD-level on x86 is known to be safe. Enable
this option so that we can do fast mremap when possible.

Link: http://lkml.kernel.org/r/20181108181201.88826-4-joelaf@google.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm/mremap: fix 'move_normal_pmd' unused function warning

The move_normal_pmd function may not be used on architectures that don't
enable HAVE_MOVE_PMD. This has shown to cause unused function warnings on
those architectures. Lets not define it for those cases.

Link: http://lkml.kernel.org/r/20181108224457.GB209347@google.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reported-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm: speed up mremap by 20x on large regions

Android needs to mremap large regions of memory during memory management
related operations.  The mremap system call can be really slow if THP is
not enabled.  The bottleneck is move_page_tables, which is copying each
pte at a time, and can be really slow across a large map.  Turning on THP
may not be a viable option, and is not for us.  This patch speeds up the
performance for non-THP system by copying at the PMD level when possible.

The speedup is an order of magnitude on x86 (~20x).  On a 1GB mremap, the
mremap completion times drops from 3.4-3.6 milliseconds to 144-160
microseconds.

Before:
Total mremap time for 1GB data: 3521942 nanoseconds.
Total mremap time for 1GB data: 3449229 nanoseconds.
Total mremap time for 1GB data: 3488230 nanoseconds.

After:
Total mremap time for 1GB data: 150279 nanoseconds.
Total mremap time for 1GB data: 144665 nanoseconds.
Total mremap time for 1GB data: 158708 nanoseconds.

If THP is enabled the optimization is mostly skipped except in certain
situations.

Link: http://lkml.kernel.org/r/20181108181201.88826-3-joelaf@google.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm: treewide: remove unused address argument from pte_alloc functions

Patch series "Add support for fast mremap"/

This series speeds up the mremap(2) syscall by copying page tables at the
PMD level even for non-THP systems.  There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something subtle
architecture related in the future that may make the scheme not work.
Also we find that there is no point in passing the 'address' to pte_alloc
since its unused.  This patch therefore removes this argument tree-wide
resulting in a nice negative diff as well.  Also ensuring along the way
that the enabled architectures do not do anything funky with the 'address'
argument that goes unnoticed by the optimization.

Build and boot tested on x86-64.  Build tested on arm64.  The config
enablement patch for arm64 will be posted in the future after more
testing.

The changes were obtained by applying the following Coccinelle script.
(thanks Julia for answering all Coccinelle questions!).
Following fix ups were done manually:
* Removal of address argument from  pte_fragment_alloc
* Removal of pte_alloc_one_fast definitions from m68k and microblaze.

// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.

virtual patch

@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@

fn(...
- , T2 E2
)
{ ... }

@pte_alloc_func_proto_noarg depends on patch exists@
type T1, T2, T3, T4;
identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1, T2);
+ T3 fn(T1);
|
- T3 fn(T1, T2, T4);
+ T3 fn(T1, T2);
)

@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)

@pte_alloc_func_call depends on patch exists@
expression E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@

fn(...
-,  E2
)

@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@

(
- #define fn(a, b, c) e
+ #define fn(a, b) e
|
- #define fn(a, b) e
+ #define fn(a) e
)

Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Suggested-by: Kirill A. Shutemov <kirill@shutemov.name>
Acked-by: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

ksm: replace jhash2 with xxhash

Replace jhash2 with xxhash.

Perf numbers:
Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
ksm: crc32c   hash() 12081 MB/s
ksm: xxh64    hash()  8770 MB/s
ksm: xxh32    hash()  4529 MB/s
ksm: jhash2   hash()  1569 MB/s

Sioh Lee did some testing:

crc32c_intel: 1084.10ns
crc32c (no hardware acceleration): 7012.51ns
xxhash32: 2227.75ns
xxhash64: 1413.16ns
jhash2: 5128.30ns

As jhash2 always will be slower (for data size like PAGE_SIZE).  Don't use
it in ksm at all.

Use only xxhash for now, because for using crc32c, cryptoapi must be
initialized first - that requires some tricky solution to work well in all
situations.

Link: http://lkml.kernel.org/r/20181023182554.23464-3-nefelim4ag@gmail.com
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: leesioh <solee@os.korea.ac.kr>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

xxHash: create arch dependent 32/64-bit xxhash()

Patch series "Currently used jhash are slow enough and replace it allow as
to make KSM", v8.

Apeed (in kernel):
        ksm: crc32c   hash() 12081 MB/s
        ksm: xxh64    hash()  8770 MB/s
        ksm: xxh32    hash()  4529 MB/s
        ksm: jhash2   hash()  1569 MB/s

Sioh Lee's testing (copy from other mail):

Test platform: openstack cloud platform (NEWTON version)
Experiment node: openstack based cloud compute node (CPU: xeon E5-2620 v3, memory 64gb)
VM: (2 VCPU, RAM 4GB, DISK 20GB) * 4
Linux kernel: 4.14 (latest version)
KSM setup - sleep_millisecs: 200ms, pages_to_scan: 200

Experiment process:
Firstly, we turn off KSM and launch 4 VMs.  Then we turn on the KSM and
measure the checksum computation time until full_scans become two.

The experimental results (the experimental value is the average of the measured values)
crc32c_intel: 1084.10ns
crc32c (no hardware acceleration): 7012.51ns
xxhash32: 2227.75ns
xxhash64: 1413.16ns
jhash2: 5128.30ns

In summary, the result shows that crc32c_intel has advantages over all of
the hash function used in the experiment.  (decreased by 84.54% compared
to crc32c, 78.86% compared to jhash2, 51.33% xxhash32, 23.28% compared to
xxhash64) the results are similar to those of Timofey.

But, use only xxhash for now, because for using crc32c, cryptoapi must be
initialized first - that require some tricky solution to work good in all
situations.

So:

- First patch implement compile time pickup of fastest implementation of
  xxhash for target platform.

- The second patch replaces jhash2 with xxhash

This patch (of 2):

xxh32() - fast on both 32/64-bit platforms
xxh64() - fast only on 64-bit platform

Create xxhash() which will pick up the fastest version at compile time.

Link: http://lkml.kernel.org/r/20181023182554.23464-2-nefelim4ag@gmail.com
Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: leesioh <solee@os.korea.ac.kr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm-memory_hotplug-be-more-verbose-for-memory-offline-failures-update

tweak dump_page() `reason' text

Link: http://lkml.kernel.org/r/20181116083020.20260-6-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Oscar Salvador <OSalvador@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm-memory_hotplug-be-more-verbose-for-memory-offline-failures-fix

add missing printk arg

Cc: Michal Hocko <mhocko@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Oscar Salvador <OSalvador@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, memory_hotplug: be more verbose for memory offline failures

There is only very limited information printed when the memory offlining
fails:

[ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff

This tells us that the failure is triggered by the userspace intervention
but it doesn't tell us much more about the underlying reason.  It might be
that the page migration failes repeatedly and the userspace timeout
expires and send a signal or it might be some of the earlier steps
(isolation, memory notifier) takes too long.

If the migration failes then it would be really helpful to see which page
that and its state.  The same applies to the isolation phase.  If we fail
to isolate a page from the allocator then knowing the state of the page
would be helpful as well.

Dump the page state that fails to get isolated or migrated.  This will
tell us more about the failure and what to focus on during debugging.

Link: http://lkml.kernel.org/r/20181107101830.17405-6-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Oscar Salvador <OSalvador@suse.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm-memory_hotplug-print-reason-for-the-offlining-failure-fix

tweak messages a bit

Cc: Baoquan He <bhe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <OSalvador@suse.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, memory_hotplug: print reason for the offlining failure

The memory offlining failure reporting is inconsistent and insufficient.
Some error paths simply do not report the failure to the log at all. When
we do report there are no details about the reason of the failure and
there are several of them which makes memory offlining failures hard to
debug.

Make sure that the
memory offlining [mem %#010llx-%#010llx] failed
message is printed for all failures and also provide a short textual
reason for the failure e.g.

[ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff

this tells us that the offlining has failed because of a signal pending
aka user intervention.

Link: http://lkml.kernel.org/r/20181107101830.17405-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Oscar Salvador <OSalvador@suse.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, memory_hotplug: drop pointless block alignment checks from __offline_pages

This function is never called from a context which would provide
misaligned pfn range so drop the pointless check.

Link: http://lkml.kernel.org/r/20181107101830.17405-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Oscar Salvador <OSalvador@suse.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm: lower the printk loglevel for __dump_page messages

__dump_page messages use KERN_EMERG resp.  KERN_ALERT loglevel (this is
the case since 2004).  Most callers of this function are really detecting
a critical page state and BUG right after.  On the other hand the function
is called also from contexts which just want to inform about the page
state and those would rather not disrupt logs that much (e.g.  some
systems route these messages to the normal console).

Reduce the loglevel to KERN_WARNING to make dump_page easier to reuse for
other contexts while those messages will still make it to the kernel log
in most setups.  Even if the loglevel setup filters warnings away those
paths that are really critical already print the more targeted error or
panic and that should make it to the kernel log.

Link: http://lkml.kernel.org/r/20181107101830.17405-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Oscar Salvador <OSalvador@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm-print-more-information-about-mapping-in-__dump_page-fix-2

use %dp to print dentry

Link: http://lkml.kernel.org/r/20181125080834.GB12455@dhcp22.suse.cz
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm: debug: Fix a width vs precision bug in printk

We had intended to only print dentry->d_name.len characters but there is
a width vs precision typo so if the name isn't NUL terminated it will
read past the end of the buffer.

Link: http://lkml.kernel.org/r/20181123072135.gqvblm2vdujbvfjs@kili.mountain
Fixes: 408ddbc22be3 ("mm: print more information about mapping in __dump_page")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm: print more information about mapping in __dump_page

I have been promissing to improve memory offlining failures debugging for
quite some time.  As things stand now we get only very limited information
in the kernel log when the offlining fails.  It is usually only

[ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed

with no further details.  We do not know what exactly fails and for what
reason.  Whenever I was forced to debug such a failure I've always had to
do a debugging patch to tell me more.  We can enable some tracepoints but
it would be much better to get a better picture without using them.

This patch series does 2 things.  The first one is to make dump_page more
usable by printing more information about the mapping patch 1.  Then it
reduces the log level from emerg to warning so that this function is
usable from less critical context patch 2.  Then I have added more
detailed information about the offlining failure patch 4 and finally add
dump_page to isolation and offlining migration paths.  Patch 3 is a
trivial cleanup.

This patch (of 5):

__dump_page prints the mapping pointer but that is quite unhelpful for
many reports because the pointer itself only helps to distinguish anon/ksm
mappings from other ones (because of lowest bits set).  Sometimes it would
be much more helpful to know what kind of mapping that is actually and if
we know this is a file mapping then also try to resolve the dentry name.

Link: http://lkml.kernel.org/r/20181107101830.17405-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Oscar Salvador <OSalvador@suse.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm-ksm-do-not-block-on-page-lock-when-searching-stable-tree-fix

coding style tweak

Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm: ksm: do not block on page lock when searching stable tree

ksmd needs to search the stable tree to look for a suitable KSM page, but
the KSM page might be locked for a long time due to the KSM page rmap
walk.

It is not worth waiting for the lock; the page can be skipped and we can
then try to merge it in the next scan to avoid long stalls if its content
is still intact.

Introduce an async mode to get_ksm_page() to not block on the page lock,
as try_to_merge_one_page() does.

Return -EBUSY if the trylock fails, since a NULL means failure to find a
suitable KSM page, which is a valid case.

Link: http://lkml.kernel.org/r/1541618201-120667-2-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm: vmscan: skip KSM page in direct reclaim if priority is low

When running a stress test, we occasionally run into the below hang issue:

INFO: task ksmd:205 blocked for more than 360 seconds.
      Tainted: G            E 4.9.128-001.ali3000_nightly_20180925_264.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ksmd            D    0   205      2 0x00000000
ffff882fa00418c0 0000000000000000 ffff882fa4b10000 ffff882fbf059d00
ffff882fa5bc1800 ffffc900190c7c28 ffffffff81725e58 ffffffff810777c0
00ffc900190c7c88 ffff882fbf059d00 ffffffff8138cc09 ffff882fa4b10000
Call Trace:
[<ffffffff81725e58>] ? __schedule+0x258/0x720
[<ffffffff810777c0>] ? do_flush_tlb_all+0x30/0x30
[<ffffffff8138cc09>] ? free_cpumask_var+0x9/0x10
[<ffffffff81726356>] schedule+0x36/0x80
[<ffffffff81729916>] schedule_timeout+0x206/0x4b0
[<ffffffff81077d0f>] ? native_flush_tlb_others+0x11f/0x180
[<ffffffff8110ca40>] ? ktime_get+0x40/0xb0
[<ffffffff81725b6a>] io_schedule_timeout+0xda/0x170
[<ffffffff81726c50>] ? bit_wait+0x60/0x60
[<ffffffff81726c6b>] bit_wait_io+0x1b/0x60
[<ffffffff81726759>] __wait_on_bit_lock+0x59/0xc0
[<ffffffff811aff76>] __lock_page+0x86/0xa0
[<ffffffff810d53e0>] ? wake_atomic_t_function+0x60/0x60
[<ffffffff8121a269>] ksm_scan_thread+0xeb9/0x1430
[<ffffffff810d5340>] ? prepare_to_wait_event+0x100/0x100
[<ffffffff812193b0>] ? try_to_merge_with_ksm_page+0x850/0x850
[<ffffffff810ac226>] kthread+0xe6/0x100
[<ffffffff810ac140>] ? kthread_park+0x60/0x60
[<ffffffff8172b196>] ret_from_fork+0x46/0x60

ksmd found a suitable KSM page on the stable tree and is trying to lock
it.  But it is locked by the direct reclaim path which is walking the
page's rmap to get the number of referenced PTEs.

The KSM page rmap walk needs to iterate all rmap_items of the page and all
rmap anon_vmas of each rmap_item.  So it may take (# rmap_item * #
children processes) loops.  This number of loops might be very large in
the worst case, and may take a long time.

Typically, direct reclaim will not intend to reclaim too many pages, and
it is latency sensitive.  So it is not worth doing the long ksm page rmap
walk to reclaim just one page.

Skip KSM pages in direct reclaim if the reclaim priority is low, but still
try to reclaim KSM pages with high priority.

Link: http://lkml.kernel.org/r/1541618201-120667-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm/readahead.c: simplify get_next_ra_size()

It's a trivial simplification for get_next_ra_size() and clear enough for
humans to understand.

It also fixes potential overflow if ra->size(< ra_pages) is too large.

Link: http://lkml.kernel.org/r/1540707206-19649-1-git-send-email-hsiangkao@aol.com
Signed-off-by: Gao Xiang <hsiangkao@aol.com>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

writeback: don't decrement wb->refcnt if !wb->bdi

This happened while running in qemu-system-aarch64, the AMBA PL011 UART
driver when enabling CONFIG_DEBUG_TEST_DRIVER_REMOVE.
arch_initcall(pl011_init) came before subsys_initcall(default_bdi_init),
devtmpfs' handle_remove() crashes because the reference count is a NULL
pointer only because wb->bdi hasn't been initialized yet.

Rework so that wb_put have an extra check if wb->bdi before decrement
wb->refcnt and also add a WARN_ON_ONCE to get a warning if it happens
again in other drivers.

Link: http://lkml.kernel.org/r/20181030113545.30999-2-anders.roxell@linaro.org
Fixes: 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Co-developed-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm/mmu_notifier.c: remove mmu_notifier_synchronize()

Contrary to its name, mmu_notifier_synchronize() does not synchronize the
notifier's SRCU instance, but rather waits for RCU callbacks to finished,
i.e. it invokes rcu_barrier(). The RCU documentation is quite clear on
this matter, explicitly calling out that rcu_barrier() does not imply
synchronize_rcu().

As there are no callers of mmu_notifier_synchronize() and it's unclear
whether any user of mmu_notifier_call_srcu() will ever want to barrier on
their callbacks, simply remove the function.

Link: http://lkml.kernel.org/r/20181106134705.14197-1-sean.j.christopherson@intel.com
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm-hotplug-optimize-clear_hwpoisoned_pages-fix

tweak comment text

Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm/hotplug: optimize clear_hwpoisoned_pages()

In hot remove, we try to clear poisoned pages, but a small optimization to
check if num_poisoned_pages is 0 helps remove the iteration through
nr_pages.

Link: http://lkml.kernel.org/r/20181102120001.4526-1-bsingharora@gmail.com
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm-page_owner-clamp-read-count-to-page_size-fix

use min_t()

Cc: Joe Perches <joe@perches.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Miles Chen <miles.chen@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm/page_owner: clamp read count to PAGE_SIZE

The (root-only) page owner read might allocate a large size of memory with
a large read count. Allocation fails can easily occur when doing high
order allocations.

Clamp buffer size to PAGE_SIZE to avoid arbitrary size allocation
and avoid allocation fails due to high order allocation.

Link: http://lkml.kernel.org/r/1541091607-27402-1-git-send-email-miles.chen@mediatek.com
Signed-off-by: Miles Chen <miles.chen@mediatek.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Joe Perches <joe@perches.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

include/linux/slab.h: fix sparse warning in kmalloc_type()

Multiple people have reported the following sparse warning:

./include/linux/slab.h:332:43: warning: dubious: x & !y

The minimal fix would be to change the logical & to boolean &&, which
emits the same code, but Andrew has suggested that the branch-avoiding
tricks are maybe not worthwile. David Laight provided a nice comparison
of disassembly of multiple variants, which shows that the current version
produces a 4 deep dependency chain, and fixing the sparse warning by
changing logical and to multiplication emits an IMUL, making it even more
expensive.

The code as rewritten by this patch yielded the best disassembly, with a
single predictable branch for the most common case, and a ternary operator
for the rest, which gcc seems to compile without a branch or cmov by
itself.

The result should be more readable, without a sparse warning and probably
also faster for the common case.

Link: http://lkml.kernel.org/r/80340595-d7c5-97b9-4f6c-23fa893a91e9@suse.cz
Fixes: 1291523f2c1d ("mm, slab/slub: introduce kmalloc-reclaimable caches")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Bart Van Assche <bvanassche@acm.org>
Reported-by: Darryl T. Agostinelli <dagostinelli@gmail.com>
Reported-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm-slub-improve-performance-by-skipping-checked-node-in-get_any_partial-fix

rename variable, tweak comment

Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm/slub.c: improve performance by skipping checked node in get_any_partial()

1. Background

  Current slub has three layers:

    * cpu_slab
    * percpu_partial
    * per node partial list

  Slub allocator tries to get an object from top to bottom.  When it
  can't get an object from the upper two layers, it will search the per
  node partial list.  The is done in get_partial().

  The abstraction of get_partial() look like this:

      get_partial()
          get_partial_node()
          get_any_partial()
              for_each_zone_zonelist()

  The idea behind this is: first try a local node, then try other nodes
  if caller doesn't specify a node.

2. Room for Improvement

  When we look one step deeper in get_any_partial(), it tries to get a
  proper node by for_each_zone_zonelist(), which iterates on the
  node_zonelists.

  This behavior would introduce some redundant check on the same node.
  Because:

    * the local node is already checked in get_partial_node()
    * one node may have several zones on node_zonelists

3. Solution Proposed in Patch

  We could reduce these redundant check by record the last unsuccessful
  node and then skip it.

4. Tests & Result

  After some tests, the result shows this may improve the system a little,
  especially on a machine with only one node.

4.1 Test Description

  There are two cases for two system configurations.

  Test Cases:

    1. counter comparison
    2. kernel build test

  System Configuration:

    1. One node machine with 4G
    2. Four node machine with 8G

4.2 Result for Test 1

  Test 1: counter comparison

  This is a test with hacked kernel to record times function
  get_any_partial() is invoked and times the inner loop iterates. By
  comparing the ratio of two counters, we get to know how many inner
  loops we skipped.

  Here is a snip of the test patch.

  ---
  static void *get_any_partial() {

get_partial_count++;

        do {
for_each_zone_zonelist() {
get_partial_try_count++;
}
} while();

return NULL;
  }
  ---

  The result of (get_partial_count / get_partial_try_count):

   +----------+----------------+------------+-------------+
   |          |       Base     |    Patched |  Improvement|
   +----------+----------------+------------+-------------+
   |One Node  |       1:3      |    1:0     |      - 100% |
   +----------+----------------+------------+-------------+
   |Four Nodes|       1:5.8    |    1:2.5   |      -  56% |
   +----------+----------------+------------+-------------+

4.3 Result for Test 2

  Test 2: kernel build

   Command used:

   > time make -j8 bzImage

   Each version/system configuration combination has four round kernel
   build tests. Take the average result of real to compare.

   +----------+----------------+------------+-------------+
   |          |       Base     |   Patched  |  Improvement|
   +----------+----------------+------------+-------------+
   |One Node  |      4m41s     |   4m32s    |     - 4.47% |
   +----------+----------------+------------+-------------+
   |Four Nodes|      4m45s     |   4m39s    |     - 2.92% |
   +----------+----------------+------------+-------------+

Link: http://lkml.kernel.org/r/20181120033119.30013-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm-slub-record-final-state-of-slub-action-in-deactivate_slab-fix

more whitespace cleanup

Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm/slub.c: record final state of slub action in deactivate_slab()

If __cmpxchg_double_slab() fails and (l != m), current code records
transition states of slub action.

Update the action after __cmpxchg_double_slab() success to record the
final state.

Link: http://lkml.kernel.org/r/20181107013119.3816-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm/slub.c: page is always non-NULL in node_match()

node_match() is a static function and is only invoked in slub.c.

In all three places, `page' is ensured to be valid.

Link: http://lkml.kernel.org/r/20181106150245.1668-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm/slub.c: remove validation on cpu_slab in __flush_cpu_slab()

cpu_slab is a per cpu variable which is allocated in all or none. If a
cpu_slab failed to be allocated, the slub is not usable.

We could use cpu_slab without validation in __flush_cpu_slab().

Link: http://lkml.kernel.org/r/20181103141218.22844-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm-slab-remove-unnecessary-unlikely-fix

s/WARN_ON/WARN_ON_ONCE/, per Vlastimil

Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yangtao Li <tiny.windzz@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slab: remove unnecessary unlikely()

WARN_ON() already contains an unlikely(), so it's not necessary to use
unlikely.

Also change WARN_ON() back to WARN_ON_ONCE() to avoid potentially
spamming dmesg with user-triggerable large allocations.

Link: http://lkml.kernel.org/r/20181104125028.3572-1-tiny.windzz@gmail.com
Signed-off-by: Yangtao Li <tiny.windzz@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

block: restore /proc/partitions to not display non-partitionable removable devices

We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before.

Looking into this I found that commit d27769ec ("block: add
GENHD_FL_NO_PART_SCAN") introduces this change in behavior. It's not
clear to me from the commit's changelog if this change was intentional or
not. This comment still remains: /* Don't show non-partitionable
removeable devices or empty devices */ so I've decided to send a patch to
restore the behavior of not printing unpartitionable removable devices.

Signed-off-by: Josh Hunt <johunt@akamai.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

ocfs2-clear-zero-in-unaligned-direct-io-checkpatch-fixes

WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
#42: FILE: fs/ocfs2/aops.c:2155:
+ unsigned i_blkbits = inode->i_sb->s_blocksize_bits;

ERROR: code indent should use tabs where possible
#53: FILE: fs/ocfs2/aops.c:2166:
+ ^I * "pos" and "end", we need map twice to return different buffer state:$

WARNING: please, no space before tabs
#53: FILE: fs/ocfs2/aops.c:2166:
+ ^I * "pos" and "end", we need map twice to return different buffer state:$

ERROR: code indent should use tabs where possible
#54: FILE: fs/ocfs2/aops.c:2167:
+ ^I * 1. area in file size, not set NEW;$

WARNING: please, no space before tabs
#54: FILE: fs/ocfs2/aops.c:2167:
+ ^I * 1. area in file size, not set NEW;$

ERROR: code indent should use tabs where possible
#55: FILE: fs/ocfs2/aops.c:2168:
+ ^I * 2. area out file size, set  NEW.$

WARNING: please, no space before tabs
#55: FILE: fs/ocfs2/aops.c:2168:
+ ^I * 2. area out file size, set  NEW.$

ERROR: code indent should use tabs where possible
#56: FILE: fs/ocfs2/aops.c:2169:
+ ^I *$

WARNING: please, no space before tabs
#56: FILE: fs/ocfs2/aops.c:2169:
+ ^I *$

ERROR: code indent should use tabs where possible
#57: FILE: fs/ocfs2/aops.c:2170:
+ ^I *^I^I   iblock    endblk$

WARNING: please, no space before tabs
#57: FILE: fs/ocfs2/aops.c:2170:
+ ^I *^I^I   iblock    endblk$

ERROR: code indent should use tabs where possible
#58: FILE: fs/ocfs2/aops.c:2171:
+ ^I * |--------|---------|---------|---------$

WARNING: please, no space before tabs
#58: FILE: fs/ocfs2/aops.c:2171:
+ ^I * |--------|---------|---------|---------$

ERROR: code indent should use tabs where possible
#59: FILE: fs/ocfs2/aops.c:2172:
+ ^I * |<-------area in file------->|$

WARNING: please, no space before tabs
#59: FILE: fs/ocfs2/aops.c:2172:
+ ^I * |<-------area in file------->|$

ERROR: code indent should use tabs where possible
#60: FILE: fs/ocfs2/aops.c:2173:
+ ^I */$

WARNING: please, no space before tabs
#60: FILE: fs/ocfs2/aops.c:2173:
+ ^I */$

total: 8 errors, 9 warnings, 40 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

NOTE: Whitespace errors detected.
      You may wish to use scripts/cleanpatch or scripts/cleanfile

./patches/ocfs2-clear-zero-in-unaligned-direct-io.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Jia Guo <guojia12@huawei.com>
Cc: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

ocfs2: clear zero in unaligned direct IO

Unused portion of a part-written fs-block-sized block is not set to zero
in unaligned append direct write.This can lead to serious data
inconsistencies.

Ocfs2 manage disk with cluster size(for example, 1M), part-written in one
cluster will change the cluster state from UN-WRITTEN to WRITTEN,
VFS(function dio_zero_block) doesn't do the cleaning because bh's state is
not set to NEW in function ocfs2_dio_wr_get_block when we write a WRITTEN
cluster. For example, the cluster size is 1M, file size is 8k and we
direct write from 14k to 15k, then 12k~14k and 15k~16k will contain dirty
data.

We have to deal with two cases:
1.The starting position of direct write is outside the file.
2.The starting position of direct write is located in the file.

We need set bh's state to NEW in the first case. In the second case, we
need mapped twice because bh's state of area out file should be set to NEW
while area in file not.

Link: http://lkml.kernel.org/r/5292e287-8f1a-fd4a-1a14-661e555e0bed@huawei.com
Signed-off-by: Jia Guo <guojia12@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

ocfs2: don't clear bh uptodate for block read

For sync io read in ocfs2_read_blocks_sync(), first clear bh uptodate flag
and submit the io, second wait io done, last check whether bh uptodate, if
not return io error.

If two sync io for the same bh were issued, it could be the first io done
and set uptodate flag, but just before check that flag, the second io came
in and cleared uptodate, then ocfs2_read_blocks_sync() for the first io
will return IO error.

Indeed it's not necessary to clear uptodate flag, as the io end handler
end_buffer_read_sync() will set or clear it based on io succeed or failed.

The following message was found from a nfs server but the underlying
storage returned no error.

[4106438.567376] (nfsd,7146,3):ocfs2_get_suballoc_slot_bit:2780 ERROR: read block 1238823695 failed -5
[4106438.567569] (nfsd,7146,3):ocfs2_get_suballoc_slot_bit:2812 ERROR: status = -5
[4106438.567611] (nfsd,7146,3):ocfs2_test_inode_bit:2894 ERROR: get alloc slot and bit failed -5
[4106438.567643] (nfsd,7146,3):ocfs2_test_inode_bit:2932 ERROR: status = -5
[4106438.567675] (nfsd,7146,3):ocfs2_get_dentry:94 ERROR: test inode bit failed -5

Same issue in non sync read ocfs2_read_blocks(), fixed it as well.

Link: http://lkml.kernel.org/r/20181121020023.3034-4-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

ocfs2: clear journal dirty flag after shutdown journal

Dirty flag of the journal should be cleared at the last stage of umount,
if do it before jbd2_journal_destroy(), then some metadata in uncommitted
transaction could be lost due to io error, but as dirty flag of journal
was already cleared, we can't find that until run a full fsck. This may
cause system panic or other corruption.

Link: http://lkml.kernel.org/r/20181121020023.3034-3-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@versity.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

ocfs2: fix panic due to unrecovered local alloc

mount.ocfs2 ignore the inconsistent error that journal is clean but local
alloc is unrecovered.  After mount, local alloc not empty, then reserver
cluster didn't alloc a new local alloc window, reserveration map is
empty(ocfs2_reservation_map.m_bitmap_len = 0), that triggered the
following panic.

This issue was reported at
https://oss.oracle.com/pipermail/ocfs2-devel/2015-May/010854.html and was
advised to fixed during mount.  But this is a very unusual inconsistent
state, usually journal dirty flag should be cleared at the last stage of
umount until every other things go right.  We may need do further debug to
check that.  Any way to avoid possible futher corruption, mount should be
abort and fsck should be run.

[   44.760372] (mount.ocfs2,1765,1):ocfs2_load_local_alloc:353 ERROR: Local alloc hasn't been recovered!
               found = 6518, set = 6518, taken = 8192, off = 15912372
[   44.780879] ocfs2: Mounting device (202,64) on (node 0, slot 3) with ordered data mode.
[   44.872435] o2dlm: Joining domain 89CEAC63CC4F4D03AC185B44E0EE0F3F ( 0 1 2 3 4 5 6 8 ) 8 nodes
[   44.902414] ocfs2: Mounting device (202,80) on (node 0, slot 3) with ordered data mode.
[   46.066444] o2hb: Region 89CEAC63CC4F4D03AC185B44E0EE0F3F (xvdf) is now a quorum device
[  178.576454] o2net: Accepted connection from node yvwsoa17p (num 7) at 172.22.77.88:7777
[  191.175670] o2dlm: Node 7 joins domain 64FE421C8C984E6D96ED12C55FEE2435 ( 0 1 2 3 4 5 6 7 8 ) 9 nodes
[  191.318225] o2dlm: Node 7 joins domain 89CEAC63CC4F4D03AC185B44E0EE0F3F ( 0 1 2 3 4 5 6 7 8 ) 9 nodes
[  838.049923] ------------[ cut here ]------------
[  838.050005] kernel BUG at fs/ocfs2/reservations.c:507!
[  838.050005] invalid opcode: 0000 [#1] SMP
[  838.050005] Modules linked in: ocfs2 rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs fscache lockd grace ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sunrpc ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 ovmapi ppdev parport_pc parport xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea acpi_cpufreq pcspkr i2c_piix4 i2c_core sg ext4 jbd2 mbcache2 sr_mod cdrom xen_blkfront pata_acpi ata_generic ata_piix floppy dm_mirror dm_region_hash dm_log dm_mod
[  838.050005] CPU: 0 PID: 4349 Comm: startWebLogic.s Not tainted 4.1.12-124.19.2.el6uek.x86_64 #2
[  838.050005] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 09/06/2018
[  838.050005] task: ffff8803fb04e200 ti: ffff8800ea4d8000 task.ti: ffff8800ea4d8000
[  838.050005] RIP: 0010:[<ffffffffa05e96a8>]  [<ffffffffa05e96a8>] __ocfs2_resv_find_window+0x498/0x760 [ocfs2]
[  838.050005] RSP: 0018:ffff8800ea4db668  EFLAGS: 00010246
[  838.050005] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  838.050005] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[  838.050005] RBP: ffff8800ea4db708 R08: 0000000000000000 R09: ffff8800ea4db6d0
[  838.050005] R10: ffff8803f5c74030 R11: 0000000000000000 R12: 0000000000000000
[  838.050005] R13: 0000000000000000 R14: ffff8800ea4db801 R15: ffff8800eab9c000
[  838.050005] FS:  00007f1e92306700(0000) GS:ffff8803ff200000(0000) knlGS:0000000000000000
[  838.050005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  838.050005] CR2: 00000000018e5fbc CR3: 00000003f63d4000 CR4: 0000000000160670
[  838.050005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  838.050005] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  838.050005] Stack:
[  838.050005]  ffff8800ea4db6d4 ffff8803f5fd3070 ffff8803f5c74030 ffff8803fba5e7b8
[  838.050005]  ffffffffa064b4f0 ffff8803fb9ef0f8 ffff8800eb638ee8 ffff8803f5fd3070
[  838.050005]  ffff8800ea4db718 ffff8800eab9c230 ffff880000000010 0000000000000000
[  838.050005] Call Trace:
[  838.050005]  [<ffffffffa05e9c4d>] ocfs2_resmap_resv_bits+0x10d/0x400 [ocfs2]
[  838.050005]  [<ffffffffa05c98c2>] ? ocfs2_journal_dirty+0x32/0xa0 [ocfs2]
[  838.050005]  [<ffffffffa060e880>] ? olq_update_info+0x50/0x50 [ocfs2]
[  838.050005]  [<ffffffffa05cf3f0>] ocfs2_claim_local_alloc_bits+0xd0/0x640 [ocfs2]
[  838.050005]  [<ffffffffa05f3f38>] __ocfs2_claim_clusters+0x178/0x360 [ocfs2]
[  838.050005]  [<ffffffffa05f687f>] ocfs2_claim_clusters+0x1f/0x30 [ocfs2]
[  838.050005]  [<ffffffffa05980e4>] ocfs2_convert_inline_data_to_extents+0x634/0xa60 [ocfs2]
[  838.050005]  [<ffffffffa060cf14>] ? ocfs2_buffer_cached.isra.6+0xb4/0x230 [ocfs2]
[  838.050005]  [<ffffffffa060d965>] ? ocfs2_set_buffer_uptodate+0x25/0x600 [ocfs2]
[  838.050005]  [<ffffffff81241f44>] ? __find_get_block+0xc4/0x140
[  838.050005]  [<ffffffff811eabe6>] ? kmem_cache_alloc_trace+0x246/0x280
[  838.050005]  [<ffffffffa059d436>] ocfs2_write_begin_nolock+0x1c6/0x1da0 [ocfs2]
[  838.050005]  [<ffffffffa05c0f60>] ? ocfs2_inode_cache_io_unlock+0x20/0x20 [ocfs2]
[  838.050005]  [<ffffffffa05b548b>] ? ocfs2_inode_lock_full_nested+0x2eb/0x520 [ocfs2]
[  838.050005]  [<ffffffffa0624f16>] ? ocfs2_xattr_get+0xa6/0x150 [ocfs2]
[  838.050005]  [<ffffffffa059f14e>] ocfs2_write_begin+0x13e/0x230 [ocfs2]
[  838.050005]  [<ffffffff8118c49f>] generic_perform_write+0xbf/0x1c0
[  838.050005]  [<ffffffff812282fe>] ? dentry_needs_remove_privs.part.11+0x1e/0x30
[  838.050005]  [<ffffffff8118e79c>] __generic_file_write_iter+0x19c/0x1d0
[  838.050005]  [<ffffffffa05b5119>] ? ocfs2_inode_unlock+0xa9/0x130 [ocfs2]
[  838.050005]  [<ffffffffa05bfba9>] ocfs2_file_write_iter+0x589/0x1360 [ocfs2]
[  838.050005]  [<ffffffff811bbd35>] ? do_wp_page+0x265/0x680
[  838.050005]  [<ffffffff8124d534>] ? fsnotify+0x384/0x530
[  838.050005]  [<ffffffff8120af08>] __vfs_write+0xb8/0x110
[  838.050005]  [<ffffffff8120b5d9>] vfs_write+0xa9/0x1b0
[  838.050005]  [<ffffffff816ee4a6>] ? mutex_lock+0x16/0x40
[  838.050005]  [<ffffffff8120c3e6>] SyS_write+0x46/0xb0
[  838.050005]  [<ffffffff816f13df>] ? system_call_after_swapgs+0xe9/0x190
[  838.050005]  [<ffffffff816f13d8>] ? system_call_after_swapgs+0xe2/0x190
[  838.050005]  [<ffffffff816f13d1>] ? system_call_after_swapgs+0xdb/0x190
[  838.050005]  [<ffffffff816f149e>] system_call_fastpath+0x18/0xd7
[  838.050005] Code: ff ff 8b 75 b8 39 75 b0 8b 45 c8 89 45 98 0f 84 e5 fe ff ff 45 8b 74 24 18 41 8b 54 24 1c e9 56 fc ff ff 85 c0 0f 85 48 ff ff ff <0f> 0b 48 8b 05 cf c3 de ff 48 ba 00 00 00 00 00 00 00 10 48 85
[  838.050005] RIP  [<ffffffffa05e96a8>] __ocfs2_resv_find_window+0x498/0x760 [ocfs2]
[  838.050005]  RSP <ffff8800ea4db668>
[  838.202227] ---[ end trace 566f07529f2edf3c ]---
[  838.204664] Kernel panic - not syncing: Fatal exception
[  838.205656] Kernel Offset: disabled

Link: http://lkml.kernel.org/r/20181121020023.3034-2-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

ocfs2: improve ocfs2 Makefile

Included file path was hard-wired in the ocfs2 makefile, which might
causes some confusion when compiling ocfs2 as an external module.

Say if we compile ocfs2 module as following.
cp -r /kernel/tree/fs/ocfs2 /other/dir/ocfs2
cd /other/dir/ocfs2
make -C /path/to/kernel_source M=`pwd` modules

Acutally, the compiler wil try to find included file in
/kernel/tree/fs/ocfs2, rather than the directory /other/dir/ocfs2.

To fix this little bug, we introduce the var $(src) provided by kbuild.
$(src) means the absolute path of the running kbuild file.

Link: http://lkml.kernel.org/r/20181108085546.15149-1-lchen@suse.com
Signed-off-by: Larry Chen <lchen@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

ocfs2: remove set but not used variable 'lastzero'

lastzero is not used after setting its value. It is safe to remove the
unused variable.

Link: http://lkml.kernel.org/r/1540296942-24533-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

ocfs2: dlmfs: remove set but not used variable 'status'

status is not used after setting its value. It is safe to remove the
unused variable.

Link: http://lkml.kernel.org/r/1540300179-26697-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

ocfs2: optimize the reading of heartbeat data

Reading heartbeat data from lowest node rather than from zero, in cases
where the node is not defined from zero, can reduce the number of sectors
read.

Here is a simple test data obtained with 'iostat -dmx dm-5 2', with
two nodes in the cluster, node number 10, 20, respectively.

Before optimization:
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-5              0.00     0.00    0.50    0.50     0.01     0.00    11.00     0.00    1.00    1.00    1.00   1.50   0.15

After the optimization:
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-5              0.00     0.00    0.50    0.50     0.00     0.00     6.00     0.00    0.50    1.00    0.00   0.50   0.05

Link: http://lkml.kernel.org/r/99fe4988-69ac-3615-a218-3042fe6fbe72@huawei.com
Signed-off-by: Jia Guo <guojia12@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

debugobjects: call debug_objects_mem_init eariler

The current value of the early boot static pool size, 1024 is not big
enough for systems with large number of CPUs with timer or/and workqueue
objects selected.  As the results, systems have 60+ CPUs with both timer
and workqueue objects enabled could trigger "ODEBUG: Out of memory.
ODEBUG disabled".

Some debug objects are allocated during the early boot.  Enabling some
options like timers or workqueue objects may increase the size required
significantly with large number of CPUs.  For example,

CONFIG_DEBUG_OBJECTS_TIMERS:
No. CPUs x 2 (worker pool) objects:
start_kernel
  workqueue_init_early
    init_worker_pool
      init_timer_key
        debug_object_init

plus No. CPUs objects (CONFIG_HIGH_RES_TIMERS):
sched_init
  hrtick_rq_init
    hrtimer_init

CONFIG_DEBUG_OBJECTS_WORK:
No. CPUs objects:
vmalloc_init
  __init_work

plus No. CPUs x 6 (workqueue) objects:
workqueue_init_early
  alloc_workqueue
    __alloc_workqueue_key
      alloc_and_link_pwqs
        init_pwq

Also, plus No. CPUs objects:
perf_event_init
  __init_srcu_struct
    init_srcu_struct_fields
      init_srcu_struct_nodes
        __init_work

However, none of the things are actually used or required before
debug_objects_mem_init() is invoked, so just move the call right before
vmalloc_init().

According to tglx, "the reason why the call is at this place in
start_kernel() is historical.  It's because back in the days when
debugobjects were added the memory allocator was enabled way later than
today."

Link: http://lkml.kernel.org/r/20181126102407.1836-1-cai@gmx.us
Signed-off-by: Qian Cai <cai@gmx.us>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

arch/sh/boards/mach-kfr2r09/setup.c: drop pointless static qualifier in kfr2r09_usb0_gadget_setup()

There is no need to have the 'struct clk *camera_clk' variable static
since a new value is always assigned before use.

Link: http://lkml.kernel.org/r/1543628631-99957-1-git-send-email-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Jacopo Mondi <jacopo+renesas@jmondi.org>
Cc: "Miquel Raynal" <miquel.raynal@bootlin.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

arch/sh/boards/mach-kfr2r09/setup.c: fix struct mtd_oob_ops build warning

arch/sh/boards/mach-kfr2r09/setup.c does not need to #include
<mtd/onenand.h>, and doing so causes a build warning, so drop that header
file.

In file included from ../arch/sh/boards/mach-kfr2r09/setup.c:28:
../include/linux/mtd/onenand.h:225:12: warning: 'struct mtd_oob_ops' declared inside parameter list will not be visible outside of this definition or declaration
struct mtd_oob_ops *ops);

Link: http://lkml.kernel.org/r/702f0a25-c63e-6912-4640-6ab0f00afbc7@infradead.org
Fixes: f3590dc32974 ("media: arch: sh: kfr2r09: Use new renesas-ceu camera driver")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Suggested-by: Miquel Raynal <miquel.raynal@bootlin.com>
Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Jacopo Mondi <jacopo+renesas@jmondi.org>
Cc: Magnus Damm <magnus.damm@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

bloat-o-meter: ignore __addressable_ symbols

Since __LINE__ is part of the symbol created by __ADDRESSABLE, almost
any change causes those symbols to disappear and get reincarnated, e.g.

add/remove: 4/4 grow/shrink: 0/3 up/down: 32/-171 (-139)
Function                                     old     new   delta
__addressable_tracing_set_default_clock8649       -       8      +8
__addressable_tracer_init_tracefs8631          -       8      +8
__addressable_ftrace_dump8383                  -       8      +8
__addressable_clear_boot_tracer8632            -       8      +8
__addressable_tracing_set_default_clock8650       8       -      -8
__addressable_tracer_init_tracefs8632          8       -      -8
__addressable_ftrace_dump8384                  8       -      -8
__addressable_clear_boot_tracer8633            8       -      -8
trace_default_header                         663     642     -21
tracing_mark_raw_write                       406     355     -51
tracing_mark_write                           624     557     -67
Total: Before=63889, After=63750, chg -0.22%

They're small and in .discard, so ignore them, leading to more useful

add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-139 (-139)
Function                                     old     new   delta
trace_default_header                         663     642     -21
tracing_mark_raw_write                       406     355     -51
tracing_mark_write                           624     557     -67
Total: Before=63721, After=63582, chg -0.22%

Link: http://lkml.kernel.org/r/20181102210030.8383-1-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Maninder Singh <maninder1.s@samsung.com>
Cc: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan: add SPDX-License-Identifier mark to source files

This patch adds a "SPDX-License-Identifier: GPL-2.0" mark to all source
files under mm/kasan.

Link: http://lkml.kernel.org/r/8e26a568b12ea02e11c35b681f3c36aff2fc1d77.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan: update documentation

This patch updates KASAN documentation to reflect the addition of the new
tag-based mode.

Link: http://lkml.kernel.org/r/1ace22e3a154ce363661bda6328f8c5eb05a091c.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan, arm64: select HAVE_ARCH_KASAN_SW_TAGS

Now, that all the necessary infrastructure code has been introduced,
select HAVE_ARCH_KASAN_SW_TAGS for arm64 to enable software tag-based
KASAN mode.

Link: http://lkml.kernel.org/r/996c9b3898bb3c5de977d00215ddc4bf8cf154c1.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan: add __must_check annotations to kasan hooks

This patch adds __must_check annotations to kasan hooks that return a
pointer to make sure that a tagged pointer always gets propagated.

Link: http://lkml.kernel.org/r/6d8c6f59c5b5a3dde569f893ecf3b56e58030ba1.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Suggested-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan, mm, arm64: tag non slab memory allocated via pagealloc

Tag-based KASAN doesn't check memory accesses through pointers tagged with
0xff. When page_address is used to get pointer to memory that corresponds
to some page, the tag of the resulting pointer gets set to 0xff, even
though the allocated memory might have been tagged differently.

For slab pages it's impossible to recover the correct tag to return from
page_address, since the page might contain multiple slab objects tagged
with different values, and we can't know in advance which one of them is
going to get accessed. For non slab pages however, we can recover the tag
in page_address, since the whole page was marked with the same tag.

This patch adds tagging to non slab memory allocated with pagealloc. To
set the tag of the pointer returned from page_address, the tag gets stored
to page->flags when the memory gets allocated.

Link: http://lkml.kernel.org/r/6c1004acf28880f6a5cc7d2f974ba08adb2853ea.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan, arm64: add brk handler for inline instrumentation

Tag-based KASAN inline instrumentation mode (which embeds checks of shadow
memory into the generated code, instead of inserting a callback) generates
a brk instruction when a tag mismatch is detected.

This commit adds a tag-based KASAN specific brk handler, that decodes the
immediate value passed to the brk instructions (to extract information
about the memory access that triggered the mismatch), reads the register
values (x0 contains the guilty address) and reports the bug.

Link: http://lkml.kernel.org/r/e825441eda1dbbbb7f583f826a66c94e6f88316a.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan: add hooks implementation for tag-based mode

This commit adds tag-based KASAN specific hooks implementation and
adjusts common generic and tag-based KASAN ones.

1. When a new slab cache is created, tag-based KASAN rounds up the size of
   the objects in this cache to KASAN_SHADOW_SCALE_SIZE (== 16).

2. On each kmalloc tag-based KASAN generates a random tag, sets the shadow
   memory, that corresponds to this object to this tag, and embeds this
   tag value into the top byte of the returned pointer.

3. On each kfree tag-based KASAN poisons the shadow memory with a random
   tag to allow detection of use-after-free bugs.

The rest of the logic of the hook implementation is very much similar to
the one provided by generic KASAN. Tag-based KASAN saves allocation and
free stack metadata to the slab object the same way generic KASAN does.

Link: http://lkml.kernel.org/r/b10d44bace6a7e9279b9b5a5b4c2a9c4c58cbf4f.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm: move obj_to_index to include/linux/slab_def.h

While with SLUB we can actually preassign tags for caches with contructors
and store them in pointers in the freelist, SLAB doesn't allow that since
the freelist is stored as an array of indexes, so there are no pointers to
store the tags.

Instead we compute the tag twice, once when a slab is created before
calling the constructor and then again each time when an object is
allocated with kmalloc. Tag is computed simply by taking the lowest byte of
the index that corresponds to the object. However in kasan_kmalloc we only
have access to the objects pointer, so we need a way to find out which
index this object corresponds to.

This patch moves obj_to_index from slab.c to include/linux/slab_def.h to
be reused by KASAN.

Link: http://lkml.kernel.org/r/b68796c554fba66d5285274ea6356e642e18a9e5.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan: add bug reporting routines for tag-based mode

This commit adds rountines, that print tag-based KASAN error reports.
Those are quite similar to generic KASAN, the difference is:

1. The way tag-based KASAN finds the first bad shadow cell (with a
   mismatching tag). Tag-based KASAN compares memory tags from the shadow
   memory to the pointer tag.

2. Tag-based KASAN reports all bugs with the "KASAN: invalid-access"
   header.

Also simplify generic KASAN find_first_bad_addr.

Link: http://lkml.kernel.org/r/996c09ec2c8f11294c106973f3b1a211417fa74e.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan: split out generic_report.c from report.c

This patch moves generic KASAN specific error reporting routines to
generic_report.c without any functional changes, leaving common error
reporting code in report.c to be later reused by tag-based KASAN.

Link: http://lkml.kernel.org/r/9030fe246a786be1348f8b08089f30e52be23ec4.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan, mm: perform untagged pointers comparison in krealloc

The krealloc function checks where the same buffer was reused or a new one
allocated by comparing kernel pointers. Tag-based KASAN changes memory tag
on the krealloc'ed chunk of memory and therefore also changes the pointer
tag of the returned pointer. Therefore we need to perform comparison on
untagged (with tags reset) pointers to check whether it's the same memory
region or not.

Link: http://lkml.kernel.org/r/5045db8a8e249a1eda3199b952120035eacb3bd4.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan, arm64: enable top byte ignore for the kernel

Tag-based KASAN uses the Top Byte Ignore feature of arm64 CPUs to store a
pointer tag in the top byte of each pointer. This commit enables the
TCR_TBI1 bit, which enables Top Byte Ignore for the kernel, when tag-based
KASAN is used.

Link: http://lkml.kernel.org/r/1ed03d53ee679cba52ba7118d2acbef948d21fcc.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan, arm64: fix up fault handling logic

Right now arm64 fault handling code removes pointer tags from addresses
covered by TTBR0 in faults taken from both EL0 and EL1, but doesn't do
that for pointers covered by TTBR1.

This patch adds two helper functions is_ttbr0_addr() and is_ttbr1_addr(),
where the latter one accounts for the fact that TTBR1 pointers might be
tagged when tag-based KASAN is in use, and uses these helper functions to
perform pointer checks in arch/arm64/mm/fault.c.

Link: http://lkml.kernel.org/r/a54fe8c8c11948b0ac8c8b285fb36f845217c84a.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan: preassign tags to objects with ctors or SLAB_TYPESAFE_BY_RCU

An object constructor can initialize pointers within this objects based on
the address of the object. Since the object address might be tagged, we
need to assign a tag before calling constructor.

The implemented approach is to assign tags to objects with constructors
when a slab is allocated and call constructors once as usual. The
downside is that such object would always have the same tag when it is
reallocated, so we won't catch use-after-frees on it.

Also pressign tags for objects from SLAB_TYPESAFE_BY_RCU caches, since
they can be validy accessed after having been freed.

Link: http://lkml.kernel.org/r/b2c17b6674f1737f981ffa6dca7fdfc059a88435.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan, arm64: untag address in _virt_addr_is_linear

virt_addr_is_linear (which is used by virt_addr_valid) assumes that the
top byte of the address is 0xff, which isn't always the case with
tag-based KASAN.

This patch resets the tag in this macro.

Link: http://lkml.kernel.org/r/dd9cda296c70ca6b1839cf4de3ee3137cf5030e7.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan: add tag related helper functions

This commit adds a few helper functions, that are meant to be used to
work with tags embedded in the top byte of kernel pointers: to set, to
get or to reset the top byte.

Link: http://lkml.kernel.org/r/643b46fbcd6433a4be18b3a19ce9f3e727618a8d.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

arm64: move untagged_addr macro from uaccess.h to memory.h

Move the untagged_addr() macro from arch/arm64/include/asm/uaccess.h
to arch/arm64/include/asm/memory.h to be later reused by KASAN.

Also make the untagged_addr() macro accept all kinds of address types
(void *, unsigned long, etc.). This allows not to specify type casts in
each place where the macro is used. This is done by using __typeof__.

Link: http://lkml.kernel.org/r/432ef6686a25b49244f54c4dfd86bc4b20381d8a.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan: initialize shadow to 0xff for tag-based mode

A tag-based KASAN shadow memory cell contains a memory tag, that
corresponds to the tag in the top byte of the pointer, that points to that
memory. The native top byte value of kernel pointers is 0xff, so with
tag-based KASAN we need to initialize shadow memory to 0xff.

Link: http://lkml.kernel.org/r/9004fd16d56d8772775cf671a8fa66e54ed138dd.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan: rename kasan_zero_page to kasan_early_shadow_page

With tag based KASAN mode the early shadow value is 0xff and not 0x00,
so this patch renames kasan_zero_(page|pte|pmd|pud|p4d) to
kasan_early_shadow_(page|pte|pmd|pud|p4d) to avoid confusion.

Link: http://lkml.kernel.org/r/1aa5a8e562380e0bfb1d58c8ec3130436cff8b47.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan, arm64: adjust shadow size for tag-based mode

Tag-based KASAN uses 1 shadow byte for 16 bytes of kernel memory, so it
requires 1/16th of the kernel virtual address space for the shadow memory.

This commit sets KASAN_SHADOW_SCALE_SHIFT to 4 when the tag-based KASAN
mode is enabled.

Link: http://lkml.kernel.org/r/95fa472a03a8ff268e24b8730ebd108922824e74.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan: add CONFIG_KASAN_GENERIC and CONFIG_KASAN_SW_TAGS

This commit splits the current CONFIG_KASAN config option into two:
1. CONFIG_KASAN_GENERIC, that enables the generic KASAN mode (the one
that exists now);
2. CONFIG_KASAN_SW_TAGS, that enables the software tag-based KASAN mode.

The name CONFIG_KASAN_SW_TAGS is chosen as in the future we will have
another hardware tag-based KASAN mode, that will rely on hardware memory
tagging support in arm64.

With CONFIG_KASAN_SW_TAGS enabled, compiler options are changed to
instrument kernel files with -fsantize=kernel-hwaddress (except the ones
for which KASAN_SANITIZE := n is set).

Both CONFIG_KASAN_GENERIC and CONFIG_KASAN_SW_TAGS support both
CONFIG_KASAN_INLINE and CONFIG_KASAN_OUTLINE instrumentation modes.

This commit also adds empty placeholder (for now) implementation of
tag-based KASAN specific hooks inserted by the compiler and adjusts
common hooks implementation.

While this commit adds the CONFIG_KASAN_SW_TAGS config option, this option
is not selectable, as it depends on HAVE_ARCH_KASAN_SW_TAGS, which we will
enable once all the infrastracture code has been added.

Link: http://lkml.kernel.org/r/20728567aae93b5eb88a6636c94c1af73db7cdbc.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan: rename source files to reflect the new naming scheme

We now have two KASAN modes: generic KASAN and tag-based KASAN. Rename
kasan.c to generic.c to reflect that. Also rename kasan_init.c to init.c
as it contains initialization code for both KASAN modes.

Link: http://lkml.kernel.org/r/a6cd5acdc334d532754279a7d2d770e2be96419d.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan: move common generic and tag-based code to common.c

Tag-based KASAN reuses a significant part of the generic KASAN code, so
move the common parts to common.c without any functional changes.

Link: http://lkml.kernel.org/r/c9392610bdef2cbbc4d4c8accd71b551ee32e2c2.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan, slub: handle pointer tags in early_kmem_cache_node_alloc()

The previous patch updated KASAN hooks signatures and their usage in SLAB
and SLUB code, except for the early_kmem_cache_node_alloc function. This
patch handles that function separately, as it requires to reorder some of
the initialization code to correctly propagate a tagged pointer in case a
tag is assigned by kasan_kmalloc.

Link: http://lkml.kernel.org/r/3459b9f5d4daf96668d9579da2cf15eb5523284b.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

kasan, mm: change hooks signatures

Patch series "kasan: add software tag-based mode for arm64", v12.

This patchset adds a new software tag-based mode to KASAN [1].
(Initially this mode was called KHWASAN, but it got renamed,
see the naming rationale at the end of this section).

The plan is to implement HWASan [2] for the kernel with the incentive,
that it's going to have comparable to KASAN performance, but in the same
time consume much less memory, trading that off for somewhat imprecise
bug detection and being supported only for arm64.

The underlying ideas of the approach used by software tag-based KASAN are:

1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
   pointer tags in the top byte of each kernel pointer.

2. Using shadow memory, we can store memory tags for each chunk of kernel
   memory.

3. On each memory allocation, we can generate a random tag, embed it into
   the returned pointer and set the memory tags that correspond to this
   chunk of memory to the same value.

4. By using compiler instrumentation, before each memory access we can add
   a check that the pointer tag matches the tag of the memory that is being
   accessed.

5. On a tag mismatch we report an error.

With this patchset the existing KASAN mode gets renamed to generic KASAN,
with the word "generic" meaning that the implementation can be supported
by any architecture as it is purely software.

The new mode this patchset adds is called software tag-based KASAN. The
word "tag-based" refers to the fact that this mode uses tags embedded into
the top byte of kernel pointers and the TBI arm64 CPU feature that allows
to dereference such pointers. The word "software" here means that shadow
memory manipulation and tag checking on pointer dereference is done in
software. As it is the only tag-based implementation right now, "software
tag-based" KASAN is sometimes referred to as simply "tag-based" in this
patchset.

A potential expansion of this mode is a hardware tag-based mode, which would
use hardware memory tagging support (announced by Arm [3]) instead of
compiler instrumentation and manual shadow memory manipulation.

Same as generic KASAN, software tag-based KASAN is strictly a debugging
feature.

[1] https://www.kernel.org/doc/html/latest/dev-tools/kasan.html

[2] http://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html

[3] https://community.arm.com/processors/b/blog/posts/arm-a-profile-architecture-2018-developments-armv85a

====== Rationale

On mobile devices generic KASAN's memory usage is significant problem. One
of the main reasons to have tag-based KASAN is to be able to perform a
similar set of checks as the generic one does, but with lower memory
requirements.

Comment from Vishwath Mohan <vishwath@google.com>:

I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
problematic to enable for environments that don't tolerate the increased
memory pressure well. This includes,
(a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
(c) Connected components like Pixel's visual core [1].

These are both places I'd love to have a low(er) memory footprint option at
my disposal.

Comment from Evgenii Stepanov <eugenis@google.com>:

Looking at a live Android device under load, slab (according to
/proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's
overhead of 2x - 3x on top of it is not insignificant.

Not having this overhead enables near-production use - ex. running
KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
not reproduce in test configuration. These are the ones that often cost
the most engineering time to track down.

CPU overhead is bad, but generally tolerable. RAM is critical, in our
experience. Once it gets low enough, OOM-killer makes your life miserable.

[1] https://www.blog.google/products/pixel/pixel-visual-core-image-processing-and-machine-learning-pixel-2/

====== Technical details

Software tag-based KASAN mode is implemented in a very similar way to the
generic one. This patchset essentially does the following:

1. TCR_TBI1 is set to enable Top Byte Ignore.

2. Shadow memory is used (with a different scale, 1:16, so each shadow
   byte corresponds to 16 bytes of kernel memory) to store memory tags.

3. All slab objects are aligned to shadow scale, which is 16 bytes.

4. All pointers returned from the slab allocator are tagged with a random
   tag and the corresponding shadow memory is poisoned with the same value.

5. Compiler instrumentation is used to insert tag checks. Either by
   calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
   CONFIG_KASAN_INLINE flags are reused).

6. When a tag mismatch is detected in callback instrumentation mode
   KASAN simply prints a bug report. In case of inline instrumentation,
   clang inserts a brk instruction, and KASAN has it's own brk handler,
   which reports the bug.

7. The memory in between slab objects is marked with a reserved tag, and
   acts as a redzone.

8. When a slab object is freed it's marked with a reserved tag.

Bug detection is imprecise for two reasons:

1. We won't catch some small out-of-bounds accesses, that fall into the
   same shadow cell, as the last byte of a slab object.

2. We only have 1 byte to store tags, which means we have a 1/256
   probability of a tag match for an incorrect access (actually even
   slightly less due to reserved tag values).

Despite that there's a particular type of bugs that tag-based KASAN can
detect compared to generic KASAN: use-after-free after the object has been
allocated by someone else.

====== Testing

Some kernel developers voiced a concern that changing the top byte of
kernel pointers may lead to subtle bugs that are difficult to discover.
To address this concern deliberate testing has been performed.

It doesn't seem feasible to do some kind of static checking to find
potential issues with pointer tagging, so a dynamic approach was taken.
All pointer comparisons/subtractions have been instrumented in an LLVM
compiler pass and a kernel module that would print a bug report whenever
two pointers with different tags are being compared/subtracted (ignoring
comparisons with NULL pointers and with pointers obtained by casting an
error code to a pointer type) has been used. Then the kernel has been
booted in QEMU and on an Odroid C2 board and syzkaller has been run.

This yielded the following results.

The two places that look interesting are:

is_vmalloc_addr in include/linux/mm.h
is_kernel_rodata in mm/util.c

Here we compare a pointer with some fixed untagged values to make sure
that the pointer lies in a particular part of the kernel address space.
Since tag-based KASAN doesn't add tags to pointers that belong to rodata
or vmalloc regions, this should work as is. To make sure debug checks to
those two functions that check that the result doesn't change whether
we operate on pointers with or without untagging has been added.

A few other cases that don't look that interesting:

Comparing pointers to achieve unique sorting order of pointee objects
(e.g. sorting locks addresses before performing a double lock):

tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
pipe_double_lock in fs/pipe.c
unix_state_double_lock in net/unix/af_unix.c
lock_two_nondirectories in fs/inode.c
mutex_lock_double in kernel/events/core.c

ep_cmp_ffd in fs/eventpoll.c
fsnotify_compare_groups fs/notify/mark.c

Nothing needs to be done here, since the tags embedded into pointers
don't change, so the sorting order would still be unique.

Checks that a pointer belongs to some particular allocation:

is_sibling_entry in lib/radix-tree.c
object_is_on_stack in include/linux/sched/task_stack.h

Nothing needs to be done here either, since two pointers can only belong
to the same allocation if they have the same tag.

Overall, since the kernel boots and works, there are no critical bugs.
As for the rest, the traditional kernel testing way (use until fails) is
the only one that looks feasible.

Another point here is that tag-based KASAN is available under a separate
config option that needs to be deliberately enabled. Even though it might
be used in a "near-production" environment to find bugs that are not found
during fuzzing or running tests, it is still a debug tool.

====== Benchmarks

The following numbers were collected on Odroid C2 board. Both generic and
tag-based KASAN were used in inline instrumentation mode.

Boot time [1]:
* ~1.7 sec for clean kernel
* ~5.0 sec for generic KASAN
* ~5.0 sec for tag-based KASAN

Network performance [2]:
* 8.33 Gbits/sec for clean kernel
* 3.17 Gbits/sec for generic KASAN
* 2.85 Gbits/sec for tag-based KASAN

Slab memory usage after boot [3]:
* ~40 kb for clean kernel
* ~105 kb (~260% overhead) for generic KASAN
* ~47 kb (~20% overhead) for tag-based KASAN

KASAN memory overhead consists of three main parts:
1. Increased slab memory usage due to redzones.
2. Shadow memory (the whole reserved once during boot).
3. Quaratine (grows gradually until some preset limit; the more the limit,
   the more the chance to detect a use-after-free).

Comparing tag-based vs generic KASAN for each of these points:
1. 20% vs 260% overhead.
2. 1/16th vs 1/8th of physical memory.
3. Tag-based KASAN doesn't require quarantine.

[1] Time before the ext4 driver is initialized.
[2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
[3] Measured as `cat /proc/meminfo | grep Slab`.

====== Some notes

A few notes:

1. The patchset can be found here:
   https://github.com/xairy/kasan-prototype/tree/khwasan

2. Building requires a recent Clang version (7.0.0 or later).

3. Stack instrumentation is not supported yet and will be added later.

This patch (of 25):

Tag-based KASAN changes the value of the top byte of pointers returned
from the kernel allocation functions (such as kmalloc).  This patch
updates KASAN hooks signatures and their usage in SLAB and SLUB code to
reflect that.

Link: http://lkml.kernel.org/r/10068968c5dbdd1913bd60f89262e3c1a50f3d38.1543337629.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

arm: arch/arm/include/asm/page.h needs personality.h

VM_DATA_DEFAULT_FLAGS uses READ_IMPLIES_EXEC, so page.h should include
personality.h to provide this.

This fixes no known bugs and can be safely ignored ;)

Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

fs/iomap.c: get/put the page in iomap_page_create/release()

migrate_page_move_mapping() expects pages with private data set to have a
page_count elevated by 1. This is what used to happen for xfs through the
buffer_heads code before the switch to iomap in commit 82cb14175e7d ("xfs:
add support for sub-pagesize writeback without buffer_heads"). Not having
the count elevated causes move_pages() to fail on memory mapped files
coming from xfs.

Make iomap compatible with the migrate_page_move_mapping() assumption by
elevating the page count as part of iomap_page_create() and lowering it in
iomap_page_release().

Link: http://lkml.kernel.org/r/20181115184140.1388751-1-pjaroszynski@nvidia.com
Fixes: 82cb14175e7d ("xfs: add support for sub-pagesize writeback without buffer_heads")
Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Brian Foster <bfoster@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

hugetlbfs: call VM_BUG_ON_PAGE earlier in free_huge_page()

A stack trace was triggered by VM_BUG_ON_PAGE(page_mapcount(page), page)
in free_huge_page(). Unfortunately, the page->mapping field was set to
NULL before this test. This made it more difficult to determine the root
cause of the problem.

Move the VM_BUG_ON_PAGE tests earlier in the function so that if they do
trigger more information is present in the page struct.

Link: http://lkml.kernel.org/r/1543491843-23438-1-git-send-email-nic_w@163.com
Signed-off-by: Yongkai Wu <nic_w@163.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

memblock: annotate memblock_is_reserved() with __init_memblock

Found warning:

WARNING: EXPORT symbol "gsi_write_channel_scratch" [vmlinux] version generation failed, symbol will not be versioned.
WARNING: vmlinux.o(.text+0x1e0a0): Section mismatch in reference from the function valid_phys_addr_range() to the function .init.text:memblock_is_reserved()
The function valid_phys_addr_range() references
the function __init memblock_is_reserved().
This is often because valid_phys_addr_range lacks a __init
annotation or the annotation of memblock_is_reserved is wrong.

Use __init_memblock instead of __init.

Link: http://lkml.kernel.org/r/BLUPR13MB02893411BF12EACB61888E80DFAE0@BLUPR13MB0289.namprd13.prod.outlook.com
Signed-off-by: Yueyi Li <liyueyi@live.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

psi: fix reference to kernel commandline enable

The kernel commandline parameter named in CONFIG_PSI_DEFAULT_DISABLED
help text contradicts the documentation in kernel-parameters.txt, and
the code. Fix that.

Link: http://lkml.kernel.org/r/20181203213416.GA12627@cmpxchg.org
Fixes: e0c274472d ("psi: make disabling/enabling easier for vendor kernels")
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

ocfs2/dlm: return DLM_CANCELGRANT if the lock is on granted list and the operation is canceled

In dlm_move_lockres_to_recovery_list(), if the lock is in the granted
queue and cancel_pending is set, it will encounter a BUG.  I think this is
a meaningless BUG, so be prepared to remove it.  A scenario that causes
this BUG will be given below.

At the beginning, Node 1 is the master and has NL lock, Node 2 has PR
lock, Node 3 has PR lock too.

Node 1          Node 2          Node 3
             want to get EX lock.

                             want to get EX lock.

Node 3 hinder
Node 2 to get
EX lock, send
Node 3 a BAST.

                             receive BAST from
                             Node 1. downconvert
                             thread begin to
                             cancel PR to EX conversion.
                             In dlmunlock_common function,
                             downconvert thread has set
                             lock->cancel_pending,
                             but did not enter
                             dlm_send_remote_unlock_request
                             function.

             Node2 dies because
             the host is powered down.

In recovery process,
clean the lock that
related to Node2.
then finish Node 3
PR to EX request.
give Node 3 a AST.

                             receive AST from Node 1.
                             change lock level to EX,
                             move lock to granted list.

Node1 dies because
the host is powered down.

                             In dlm_move_lockres_to_recovery_list
                             function. the lock is in the
                             granted queue and cancel_pending
                             is set. BUG_ON.

But after clearing this BUG, process will encounter
the second BUG in the ocfs2_unlock_ast function.
Here is a scenario that will cause the second BUG
in ocfs2_unlock_ast as follows:

At the beginning, Node 1 is the master and has NL lock,
Node 2 has PR lock, Node 3 has PR lock too.

Node 1          Node 2          Node 3
             want to get EX lock.

                             want to get EX lock.

Node 3 hinder
Node 2 to get
EX lock, send
Node 3 a BAST.

                             receive BAST from
                             Node 1. downconvert
                             thread begin to
                             cancel PR to EX conversion.
                             In dlmunlock_common function,
                             downconvert thread has released
                             lock->spinlock and res->spinlock,
                             but did not enter
                             dlm_send_remote_unlock_request
                             function.

             Node2 dies because
             the host is powered down.

In recovery process,
clean the lock that
related to Node2.
then finish Node 3
PR to EX request.
give Node 3 a AST.

                             receive AST from Node 1.
                             change lock level to EX,
                             move lock to granted list,
                             set lockres->l_unlock_action
                             as OCFS2_UNLOCK_INVALID
                             in ocfs2_locking_ast function.

Node2 dies because
the host is powered down.

                             Node 3 realize that Node 1
                             is dead, remove Node 1 from
                             domain_map. downconvert thread
                             get DLM_NORMAL from
                             dlm_send_remote_unlock_request
                             function and set *call_ast as 1.
                             Then downconvert thread meet
                             BUG in ocfs2_unlock_ast function.

To avoid meet the second BUG, dlmunlock_common() should return
DLM_CANCELGRANT if the lock is on granted list and the operation is
canceled.

Link: http://lkml.kernel.org/r/98f0e80c-9c13-dbb6-047c-b40e100082b1@huawei.com
Signed-off-by: Jian Wang <wangjian161@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

ocfs2/dlm: clean DLM_LKSB_GET_LVB and DLM_LKSB_PUT_LVB when the cancel_pending is set

dlm_move_lockres_to_recovery_list() should clean DLM_LKSB_GET_LVB and
DLM_LKSB_PUT_LVB when the cancel_pending is set.  Otherwise node may panic
in dlm_proxy_ast_handler.

Here is the situation: At the beginning, Node1 is the master of the lock
resource and has NL lock, Node2 has PR lock, Node3 has PR lock, Node4 has
NL lock.

Node1        Node2            Node3            Node4
              convert lock_2 from
              PR to EX.

the mode of lock_3 is
PR, which blocks the
conversion request of
Node2. move lock_2 to
conversion list.

                           convert lock_3 from
                           PR to EX.

move lock_3 to conversion
list. send BAST to Node3.

                           receive BAST from Node1.
                           downconvert thread execute
                           canceling convert operation.

Node2 dies because
the host is powered down.

                           in dlmunlock_common function,
                           the downconvert thread set
                           cancel_pending. at the same
                           time, Node 3 realized that
                           Node 1 is dead, so move lock_3
                           back to granted list in
                           dlm_move_lockres_to_recovery_list
                           function and remove Node 1 from
                           the domain_map in
                           __dlm_hb_node_down function.
                           then downconvert thread failed
                           to send the lock cancellation
                           request to Node1 and return
                           DLM_NORMAL from
                           dlm_send_remote_unlock_request
                           function.

                                                 become recovery master.

              during the recovery
              process, send
              lock_2 that is
              converting form
              PR to EX to Node4.

                           during the recovery process,
                           send lock_3 in the granted list and
                           cantain the DLM_LKSB_GET_LVB
                           flag to Node4. Then downconvert thread
                           delete DLM_LKSB_GET_LVB flag in
                           dlmunlock_common function.

                                                 Node4 finish recovery.
                                                 the mode of lock_3 is
                                                 PR, which blocks the
                                                 conversion request of
                                                 Node2, so send BAST
                                                 to Node3.

                           receive BAST from Node4.
                           convert lock_3 from PR to NL.

                                                 change the mode of lock_3
                                                 from PR to NL and send
                                                 message to Node3.

                           receive message from
                           Node4. The message contain
                           LKM_GET_LVB flag, but the
                           lock->lksb->flags does not
                           contain DLM_LKSB_GET_LVB,
                           BUG_ON in dlm_proxy_ast_handler
                           function.

Link: http://lkml.kernel.org/r/bffbe5a5-acb6-652d-eb8a-99fb051e6631@huawei.com
Signed-off-by: Jian Wang <wangjian161@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

hugetlbfs: remove unnecessary code after i_mmap_rwsem synchronization

After expanding i_mmap_rwsem use for better shared pmd and page fault/
truncation synchronization, remove code that is no longer necessary.

Link: http://lkml.kernel.org/r/20181203200850.6460-4-mike.kravetz@oracle.com
Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race

hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race.  One obvious omission in the
current code is removing a page newly added to the page cache.  This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations.  To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out.  There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv.  Backing out a reservation may
require memory allocation which could fail so that needs to be taken into
account as well.

Instead of writing the required complicated code for this rare occurrence,
just eliminate the race.  i_mmap_rwsem is now held in read mode for the
duration of page fault processing.  Hold i_mmap_rwsem longer in truncation
and hold punch code to cover the call to remove_inode_hugepages.

Link: http://lkml.kernel.org/r/20181203200850.6460-3-mike.kravetz@oracle.com
Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization

Patch series "hugetlbfs: use i_mmap_rwsem for better synchronization".

These patches are a follow up to the RFC,
http://lkml.kernel.org/r/20181024045053.1467-1-mike.kravetz@oracle.com
Comments made by Naoya were addressed.

There are two primary issues addressed here:

1) For shared pmds, huge PE pointers returned by huge_pte_alloc can
   become invalid via a call to huge_pmd_unshare by another thread.

2) hugetlbfs page faults can race with truncation causing invalid
   global reserve counts and state.

Both issues are addressed by expanding the use of i_mmap_rwsem.

These issues have existed for a long time.  They can be recreated with a
test program that causes page fault/truncation races.  For simple
mappings, this results in a negative HugePages_Rsvd count.  If racing with
mappings that contain shared pmds, we can hit "BUG at
fs/hugetlbfs/inode.c:444!" or Oops!  as the result of an invalid memory
reference.

I broke up the larger RFC into separate patches addressing each issue.
Hopefully, this is easier to understand/review.

This patch (of 3):

While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table.  Consider the following:

A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
shared pmd.

Now, another task truncates the hugetlbfs file.  As part of truncation, it
unmaps everyone who has the file mapped.  If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called.  For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd.  If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse.  This leads to bad things such as incorrect page
map/reference counts or invalid memory references.

To fix, expand the use of i_mmap_rwsem as follows:

- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
  huge_pmd_share is only called via huge_pte_alloc, so callers of
  huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
  of huge_pte_alloc continue to hold the semaphore until finished with the
  ptep.

- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.

Link: http://lkml.kernel.org/r/20181203200850.6460-2-mike.kravetz@oracle.com
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, thp: always specify disabled vmas as nh in smaps

1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") introduced
a regression in that userspace cannot always determine the set of vmas
where thp is disabled.

Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
to determine if a vma has been disabled from being backed by hugepages.

Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to
be disabled and emit "nh" as a flag for the corresponding vmas as part of
/proc/pid/smaps. After the commit, thp is disabled by means of an mm flag
and "nh" is not emitted.

This causes smaps parsing libraries to assume a vma is enabled for thp and
ends up puzzling the user on why its memory is not backed by thp.

This also clears the "hg" flag to make the behavior of MADV_HUGEPAGE and
PR_SET_THP_DISABLE definitive.

Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1809251449060.96762@chino.kir.corp.google.com
Fixes: 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active")
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>