Wei Yang [Wed, 5 Dec 2018 00:13:39 +0000 (11:13 +1100)]
vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is n
fa5e084e43eb ("vmscan: do not unconditionally treat zones that fail
zone_reclaim() as full") changed the return value of node_reclaim(). The
original return value 0 means NODE_RECLAIM_SOME after this commit.
While the return value of node_reclaim() when CONFIG_NUMA is n is not
changed. This will leads to call zone_watermark_ok() again.
This patch fixes the return value by adjusting to NODE_RECLAIM_NOSCAN.
Since node_reclaim() is only called in page_alloc.c, move it to
mm/internal.h.
Link: http://lkml.kernel.org/r/20181113080436.22078-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Arun KS [Wed, 5 Dec 2018 00:13:38 +0000 (11:13 +1100)]
mm: remove managed_page_count_lock spinlock
Now that totalram_pages and managed_pages are atomic varibles, no need of
managed_page_count spinlock. The lock had really a weak consistency
guarantee. It hasn't been used for anything but the update but no reader
actually cares about all the values being updated to be in sync.
Link: http://lkml.kernel.org/r/1542090790-21750-5-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Arun KS [Wed, 5 Dec 2018 00:13:38 +0000 (11:13 +1100)]
mm: convert totalram_pages and totalhigh_pages variables to atomic
totalram_pages and totalhigh_pages are made static inline function.
Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.
Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Arun KS [Wed, 5 Dec 2018 00:13:38 +0000 (11:13 +1100)]
mm: convert zone->managed_pages to atomic variable
totalram_pages, zone->managed_pages and totalhigh_pages updates are
protected by managed_page_count_lock, but readers never care about it.
Convert these variables to atomic to avoid readers potentially seeing a
store tear.
This patch converts zone->managed_pages. Subsequent patches will convert
totalram_panges, totalhigh_pages and eventually managed_page_count_lock
will be removed.
Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
better to remove the lock and convert variables to atomic, with preventing
poteintial store-to-read tearing as a bonus.
Link: http://lkml.kernel.org/r/1542090790-21750-3-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Suggested-by: Michal Hocko <mhocko@suse.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Arun KS [Wed, 5 Dec 2018 00:13:37 +0000 (11:13 +1100)]
mm: reference totalram_pages and managed_pages once per function
Patch series "mm: convert totalram_pages, totalhigh_pages and managed
pages to atomic", v5.
This series converts totalram_pages, totalhigh_pages and
zone->managed_pages to atomic variables.
totalram_pages, zone->managed_pages and totalhigh_pages updates are
protected by managed_page_count_lock, but readers never care about it.
Convert these variables to atomic to avoid readers potentially seeing a
store tear.
Main motivation was that managed_page_count_lock handling was complicating
things. It was discussed in length here,
https://lore.kernel.org/patchwork/patch/995739/#1181785 It seemes better
to remove the lock and convert variables to atomic. With the change,
preventing poteintial store-to-read tearing comes as a bonus.
This patch (of 4):
This is in preparation to a later patch which converts totalram_pages and
zone->managed_pages to atomic variables. Please note that re-reading the
value might lead to a different value and as such it could lead to
unexpected behavior. There are no known bugs as a result of the current
code but it is better to prevent from them in principle.
Link: http://lkml.kernel.org/r/1542090790-21750-2-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Kirill Tkhai [Wed, 5 Dec 2018 00:13:37 +0000 (11:13 +1100)]
mm/ksm.c: assist buddy allocator to assemble 1-order pages
try_to_merge_two_pages() merges two pages, one of them is a page of
currently scanned mm, the second is a page with identical hash from
unstable tree. Currently, we merge the page from unstable tree into the
first one, and then free it.
The idea of this patch is to prefer freeing that page of them, which has a
free neighbour (i.e., neighbour with zero page_count()). This allows
buddy allocator to assemble at least 1-order set from the freed page and
its neighbour; this is a kind of cheep passive compaction.
AFAIK, 1-order pages set consists of pages with PFNs [2n, 2n+1] (odd,
even), so the neighbour's pfn is calculated via XOR with 1. We check the
result pfn is valid and its page_count(), and prefer merging into
@tree_page if neighbour's usage count is zero.
There a is small difference with current behavior in case of error path.
In case the second try_to_merge_with_ksm_page() fails, we return from
try_to_merge_two_pages() with @tree_page removed from the unstable tree.
It does not seem to matter, but if we do not want a change at all, it's
not a problem to move remove_rmap_item_from_tree() from
try_to_merge_with_ksm_page() to its callers.
Link: http://lkml.kernel.org/r/153995241537.4096.15189862239521235797.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jia He <jia.he@hxt-semitech.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: Jiang Biao <jiang.biao2@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wei Yang [Wed, 5 Dec 2018 00:13:37 +0000 (11:13 +1100)]
mm: remove reset of pcp->counter in pageset_init()
per_cpu_pageset is cleared by memset, it is not necessary to reset it
again.
Link: http://lkml.kernel.org/r/20181021023920.5501-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Joel Fernandes (Google) [Wed, 5 Dec 2018 00:13:36 +0000 (11:13 +1100)]
mm-add-an-f_seal_future_write-seal-to-memfd-fix-2
v4
Link: http://lkml.kernel.org/r/20181122230906.GA198127@google.com Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Joel Fernandes (Google) [Wed, 5 Dec 2018 00:13:36 +0000 (11:13 +1100)]
mm/memfd: make F_SEAL_FUTURE_WRITE seal more robust
A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week
where we don't need to modify core VFS structures to get the same behavior
of the seal. This solves several side-effects pointed out by Andy [2].
Joel Fernandes (Google) [Wed, 5 Dec 2018 00:13:36 +0000 (11:13 +1100)]
mm: Add an F_SEAL_FUTURE_WRITE seal to memfd
Android uses ashmem for sharing memory regions. We are looking forward to
migrating all usecases of ashmem to memfd so that we can possibly remove
the ashmem driver in the future from staging while also benefiting from
using memfd and contributing to it. Note staging drivers are also not ABI
and generally can be removed at anytime.
One of the main usecases Android has is the ability to create a region and
mmap it as writeable, then add protection against making any "future"
writes while keeping the existing already mmap'ed writeable-region active.
This allows us to implement a usecase where receivers of the shared
memory buffer can get a read-only view, while the sender continues to
write to the buffer. See CursorWindow documentation in Android for more
details:
This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
which prevents any future mmap and write syscalls from succeeding while
keeping the existing mmap active.
A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week
where we don't need to modify core VFS structures to get the same behavior
of the seal. This solves several side-effects pointed by Andy.
self-tests are provided in later patch to verify the expected semantics.
Michal Hocko [Wed, 5 Dec 2018 00:13:36 +0000 (11:13 +1100)]
mm, memory_hotplug: do not clear numa_node association after hot_remove
Per-cpu numa_node provides a default node for each possible cpu. The
association gets initialized during the boot when the architecture
specific code explores cpu->NUMA affinity. When the whole NUMA node is
removed though we are clearing this association
This means that whoever calls cpu_to_node for a cpu associated with such a
node will get NUMA_NO_NODE. This is problematic for two reasons. First
it is fragile because __alloc_pages_node would simply blow up on an
out-of-bound access. We have encountered this when loading kvm module
on an older kernel but the code is basically the same in the current Linus
tree as well. alloc_vmcs_cpu could use alloc_pages_nodemask which would
recognize NUMA_NO_NODE and use alloc_pages_node which would translate it
to numa_mem_id but that is wrong as well because it would use a cpu
affinity of the local CPU which might be quite far from the original node.
It is also reasonable to expect that cpu_to_node will provide a sane
value and there might be many more callers like that.
The second problem is that __register_one_node relies on cpu_to_node to
properly associate cpus back to the node when it is onlined. We do not
want to lose that link as there is no arch independent way to get it from
the early boot time AFAICS.
Drop the whole check_and_unmap_cpu_on_node machinery and keep the
association to fix both issues. The NODE_DATA(nid) is not deallocated so
it will stay in place and if anybody wants to allocate from that node then
a fallback node will be used.
Thanks to Vlastimil Babka for his live system debugging skills that helped
debugging the issue.
Link: http://lkml.kernel.org/r/20181108100413.966-1-mhocko@kernel.org Fixes: e13fe8695c57 ("cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node") Signed-off-by: Michal Hocko <mhocko@suse.com> Debugged-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Miroslav Benes <mbenes@suse.cz> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Yangtao Li [Wed, 5 Dec 2018 00:13:35 +0000 (11:13 +1100)]
mm/mmap.c: remove verify_mm_writelocked()
We should get rid of this function. It no longer serves its purpose.
This is a historical artifact from 2005 where do_brk was called outside of
the core mm. We do have a proper abstraction in vm_brk_flags and that one
does the locking properly so there is no need to use this function.
Link: http://lkml.kernel.org/r/20181108174856.10811-1-tiny.windzz@gmail.com Signed-off-by: Yangtao Li <tiny.windzz@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Joel Fernandes (Google) [Wed, 5 Dec 2018 00:13:35 +0000 (11:13 +1100)]
mm: select HAVE_MOVE_PMD on x86 for faster mremap
Moving page-tables at the PMD-level on x86 is known to be safe. Enable
this option so that we can do fast mremap when possible.
Link: http://lkml.kernel.org/r/20181108181201.88826-4-joelaf@google.com Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Suggested-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Julia Lawall <Julia.Lawall@lip6.fr> Cc: Michal Hocko <mhocko@kernel.org> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Joel Fernandes (Google) [Wed, 5 Dec 2018 00:13:35 +0000 (11:13 +1100)]
mm/mremap: fix 'move_normal_pmd' unused function warning
The move_normal_pmd function may not be used on architectures that don't
enable HAVE_MOVE_PMD. This has shown to cause unused function warnings on
those architectures. Lets not define it for those cases.
Link: http://lkml.kernel.org/r/20181108224457.GB209347@google.com Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reported-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Joel Fernandes (Google) [Wed, 5 Dec 2018 00:13:35 +0000 (11:13 +1100)]
mm: speed up mremap by 20x on large regions
Android needs to mremap large regions of memory during memory management
related operations. The mremap system call can be really slow if THP is
not enabled. The bottleneck is move_page_tables, which is copying each
pte at a time, and can be really slow across a large map. Turning on THP
may not be a viable option, and is not for us. This patch speeds up the
performance for non-THP system by copying at the PMD level when possible.
The speedup is an order of magnitude on x86 (~20x). On a 1GB mremap, the
mremap completion times drops from 3.4-3.6 milliseconds to 144-160
microseconds.
Before:
Total mremap time for 1GB data: 3521942 nanoseconds.
Total mremap time for 1GB data: 3449229 nanoseconds.
Total mremap time for 1GB data: 3488230 nanoseconds.
After:
Total mremap time for 1GB data: 150279 nanoseconds.
Total mremap time for 1GB data: 144665 nanoseconds.
Total mremap time for 1GB data: 158708 nanoseconds.
If THP is enabled the optimization is mostly skipped except in certain
situations.
Link: http://lkml.kernel.org/r/20181108181201.88826-3-joelaf@google.com Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Julia Lawall <Julia.Lawall@lip6.fr> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Joel Fernandes (Google) [Wed, 5 Dec 2018 00:13:35 +0000 (11:13 +1100)]
mm: treewide: remove unused address argument from pte_alloc functions
Patch series "Add support for fast mremap"/
This series speeds up the mremap(2) syscall by copying page tables at the
PMD level even for non-THP systems. There is concern that the extra
'address' argument that mremap passes to pte_alloc may do something subtle
architecture related in the future that may make the scheme not work.
Also we find that there is no point in passing the 'address' to pte_alloc
since its unused. This patch therefore removes this argument tree-wide
resulting in a nice negative diff as well. Also ensuring along the way
that the enabled architectures do not do anything funky with the 'address'
argument that goes unnoticed by the optimization.
Build and boot tested on x86-64. Build tested on arm64. The config
enablement patch for arm64 will be posted in the future after more
testing.
The changes were obtained by applying the following Coccinelle script.
(thanks Julia for answering all Coccinelle questions!).
Following fix ups were done manually:
* Removal of address argument from pte_fragment_alloc
* Removal of pte_alloc_one_fast definitions from m68k and microblaze.
// Options: --include-headers --no-includes
// Note: I split the 'identifier fn' line, so if you are manually
// running it, please unsplit it so it runs for you.
virtual patch
@pte_alloc_func_def depends on patch exists@
identifier E2;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
type T2;
@@
fn(...
- , T2 E2
)
{ ... }
@pte_alloc_func_proto_noarg depends on patch exists@
type T1, T2, T3, T4;
identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
(
- T3 fn(T1, T2);
+ T3 fn(T1);
|
- T3 fn(T1, T2, T4);
+ T3 fn(T1, T2);
)
@pte_alloc_func_proto depends on patch exists@
identifier E1, E2, E4;
type T1, T2, T3, T4;
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
@@
(
- T3 fn(T1 E1, T2 E2);
+ T3 fn(T1 E1);
|
- T3 fn(T1 E1, T2 E2, T4 E4);
+ T3 fn(T1 E1, T2 E2);
)
@pte_alloc_macro depends on patch exists@
identifier fn =~
"^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
identifier a, b, c;
expression e;
position p;
@@
(
- #define fn(a, b, c) e
+ #define fn(a, b) e
|
- #define fn(a, b) e
+ #define fn(a) e
)
Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Suggested-by: Kirill A. Shutemov <kirill@shutemov.name> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michal Hocko <mhocko@kernel.org> Cc: Julia Lawall <Julia.Lawall@lip6.fr> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
As jhash2 always will be slower (for data size like PAGE_SIZE). Don't use
it in ksm at all.
Use only xxhash for now, because for using crc32c, cryptoapi must be
initialized first - that requires some tricky solution to work well in all
situations.
Link: http://lkml.kernel.org/r/20181023182554.23464-3-nefelim4ag@gmail.com Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Signed-off-by: leesioh <solee@os.korea.ac.kr> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Experiment process:
Firstly, we turn off KSM and launch 4 VMs. Then we turn on the KSM and
measure the checksum computation time until full_scans become two.
The experimental results (the experimental value is the average of the measured values)
crc32c_intel: 1084.10ns
crc32c (no hardware acceleration): 7012.51ns
xxhash32: 2227.75ns
xxhash64: 1413.16ns
jhash2: 5128.30ns
In summary, the result shows that crc32c_intel has advantages over all of
the hash function used in the experiment. (decreased by 84.54% compared
to crc32c, 78.86% compared to jhash2, 51.33% xxhash32, 23.28% compared to
xxhash64) the results are similar to those of Timofey.
But, use only xxhash for now, because for using crc32c, cryptoapi must be
initialized first - that require some tricky solution to work good in all
situations.
So:
- First patch implement compile time pickup of fastest implementation of
xxhash for target platform.
- The second patch replaces jhash2 with xxhash
This patch (of 2):
xxh32() - fast on both 32/64-bit platforms
xxh64() - fast only on 64-bit platform
Create xxhash() which will pick up the fastest version at compile time.
Link: http://lkml.kernel.org/r/20181023182554.23464-2-nefelim4ag@gmail.com Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: leesioh <solee@os.korea.ac.kr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20181116083020.20260-6-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Michal Hocko <mhocko@suse.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Michal Hocko [Wed, 5 Dec 2018 00:13:33 +0000 (11:13 +1100)]
mm, memory_hotplug: be more verbose for memory offline failures
There is only very limited information printed when the memory offlining
fails:
[ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff
This tells us that the failure is triggered by the userspace intervention
but it doesn't tell us much more about the underlying reason. It might be
that the page migration failes repeatedly and the userspace timeout
expires and send a signal or it might be some of the earlier steps
(isolation, memory notifier) takes too long.
If the migration failes then it would be really helpful to see which page
that and its state. The same applies to the isolation phase. If we fail
to isolate a page from the allocator then knowing the state of the page
would be helpful as well.
Dump the page state that fails to get isolated or migrated. This will
tell us more about the failure and what to focus on during debugging.
Link: http://lkml.kernel.org/r/20181107101830.17405-6-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Baoquan He <bhe@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Michal Hocko [Wed, 5 Dec 2018 00:13:33 +0000 (11:13 +1100)]
mm, memory_hotplug: print reason for the offlining failure
The memory offlining failure reporting is inconsistent and insufficient.
Some error paths simply do not report the failure to the log at all. When
we do report there are no details about the reason of the failure and
there are several of them which makes memory offlining failures hard to
debug.
Make sure that the
memory offlining [mem %#010llx-%#010llx] failed
message is printed for all failures and also provide a short textual
reason for the failure e.g.
[ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff
this tells us that the offlining has failed because of a signal pending
aka user intervention.
Link: http://lkml.kernel.org/r/20181107101830.17405-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Michal Hocko [Wed, 5 Dec 2018 00:13:33 +0000 (11:13 +1100)]
mm, memory_hotplug: drop pointless block alignment checks from __offline_pages
This function is never called from a context which would provide
misaligned pfn range so drop the pointless check.
Link: http://lkml.kernel.org/r/20181107101830.17405-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Michal Hocko [Wed, 5 Dec 2018 00:13:33 +0000 (11:13 +1100)]
mm: lower the printk loglevel for __dump_page messages
__dump_page messages use KERN_EMERG resp. KERN_ALERT loglevel (this is
the case since 2004). Most callers of this function are really detecting
a critical page state and BUG right after. On the other hand the function
is called also from contexts which just want to inform about the page
state and those would rather not disrupt logs that much (e.g. some
systems route these messages to the normal console).
Reduce the loglevel to KERN_WARNING to make dump_page easier to reuse for
other contexts while those messages will still make it to the kernel log
in most setups. Even if the loglevel setup filters warnings away those
paths that are really critical already print the more targeted error or
panic and that should make it to the kernel log.
Link: http://lkml.kernel.org/r/20181107101830.17405-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20181125080834.GB12455@dhcp22.suse.cz Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Dan Carpenter [Wed, 5 Dec 2018 00:13:32 +0000 (11:13 +1100)]
mm: debug: Fix a width vs precision bug in printk
We had intended to only print dentry->d_name.len characters but there is
a width vs precision typo so if the name isn't NUL terminated it will
read past the end of the buffer.
Link: http://lkml.kernel.org/r/20181123072135.gqvblm2vdujbvfjs@kili.mountain Fixes: 408ddbc22be3 ("mm: print more information about mapping in __dump_page") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Michal Hocko [Wed, 5 Dec 2018 00:13:32 +0000 (11:13 +1100)]
mm: print more information about mapping in __dump_page
I have been promissing to improve memory offlining failures debugging for
quite some time. As things stand now we get only very limited information
in the kernel log when the offlining fails. It is usually only
with no further details. We do not know what exactly fails and for what
reason. Whenever I was forced to debug such a failure I've always had to
do a debugging patch to tell me more. We can enable some tracepoints but
it would be much better to get a better picture without using them.
This patch series does 2 things. The first one is to make dump_page more
usable by printing more information about the mapping patch 1. Then it
reduces the log level from emerg to warning so that this function is
usable from less critical context patch 2. Then I have added more
detailed information about the offlining failure patch 4 and finally add
dump_page to isolation and offlining migration paths. Patch 3 is a
trivial cleanup.
This patch (of 5):
__dump_page prints the mapping pointer but that is quite unhelpful for
many reports because the pointer itself only helps to distinguish anon/ksm
mappings from other ones (because of lowest bits set). Sometimes it would
be much more helpful to know what kind of mapping that is actually and if
we know this is a file mapping then also try to resolve the dentry name.
Link: http://lkml.kernel.org/r/20181107101830.17405-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Oscar Salvador <OSalvador@suse.com> Cc: Baoquan He <bhe@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Yang Shi [Wed, 5 Dec 2018 00:13:32 +0000 (11:13 +1100)]
mm: ksm: do not block on page lock when searching stable tree
ksmd needs to search the stable tree to look for a suitable KSM page, but
the KSM page might be locked for a long time due to the KSM page rmap
walk.
It is not worth waiting for the lock; the page can be skipped and we can
then try to merge it in the next scan to avoid long stalls if its content
is still intact.
Introduce an async mode to get_ksm_page() to not block on the page lock,
as try_to_merge_one_page() does.
Return -EBUSY if the trylock fails, since a NULL means failure to find a
suitable KSM page, which is a valid case.
Link: http://lkml.kernel.org/r/1541618201-120667-2-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
ksmd found a suitable KSM page on the stable tree and is trying to lock
it. But it is locked by the direct reclaim path which is walking the
page's rmap to get the number of referenced PTEs.
The KSM page rmap walk needs to iterate all rmap_items of the page and all
rmap anon_vmas of each rmap_item. So it may take (# rmap_item * #
children processes) loops. This number of loops might be very large in
the worst case, and may take a long time.
Typically, direct reclaim will not intend to reclaim too many pages, and
it is latency sensitive. So it is not worth doing the long ksm page rmap
walk to reclaim just one page.
Skip KSM pages in direct reclaim if the reclaim priority is low, but still
try to reclaim KSM pages with high priority.
Link: http://lkml.kernel.org/r/1541618201-120667-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Anders Roxell [Wed, 5 Dec 2018 00:13:31 +0000 (11:13 +1100)]
writeback: don't decrement wb->refcnt if !wb->bdi
This happened while running in qemu-system-aarch64, the AMBA PL011 UART
driver when enabling CONFIG_DEBUG_TEST_DRIVER_REMOVE.
arch_initcall(pl011_init) came before subsys_initcall(default_bdi_init),
devtmpfs' handle_remove() crashes because the reference count is a NULL
pointer only because wb->bdi hasn't been initialized yet.
Rework so that wb_put have an extra check if wb->bdi before decrement
wb->refcnt and also add a WARN_ON_ONCE to get a warning if it happens
again in other drivers.
Link: http://lkml.kernel.org/r/20181030113545.30999-2-anders.roxell@linaro.org Fixes: 52ebea749aae ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Co-developed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Contrary to its name, mmu_notifier_synchronize() does not synchronize the
notifier's SRCU instance, but rather waits for RCU callbacks to finished,
i.e. it invokes rcu_barrier(). The RCU documentation is quite clear on
this matter, explicitly calling out that rcu_barrier() does not imply
synchronize_rcu().
As there are no callers of mmu_notifier_synchronize() and it's unclear
whether any user of mmu_notifier_call_srcu() will ever want to barrier on
their callbacks, simply remove the function.
Balbir Singh [Wed, 5 Dec 2018 00:13:30 +0000 (11:13 +1100)]
mm/hotplug: optimize clear_hwpoisoned_pages()
In hot remove, we try to clear poisoned pages, but a small optimization to
check if num_poisoned_pages is 0 helps remove the iteration through
nr_pages.
Link: http://lkml.kernel.org/r/20181102120001.4526-1-bsingharora@gmail.com Signed-off-by: Balbir Singh <bsingharora@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrew Morton [Wed, 5 Dec 2018 00:13:30 +0000 (11:13 +1100)]
mm-page_owner-clamp-read-count-to-page_size-fix
use min_t()
Cc: Joe Perches <joe@perches.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Miles Chen <miles.chen@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miles Chen [Wed, 5 Dec 2018 00:13:30 +0000 (11:13 +1100)]
mm/page_owner: clamp read count to PAGE_SIZE
The (root-only) page owner read might allocate a large size of memory with
a large read count. Allocation fails can easily occur when doing high
order allocations.
Clamp buffer size to PAGE_SIZE to avoid arbitrary size allocation
and avoid allocation fails due to high order allocation.
Link: http://lkml.kernel.org/r/1541091607-27402-1-git-send-email-miles.chen@mediatek.com Signed-off-by: Miles Chen <miles.chen@mediatek.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Joe Perches <joe@perches.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vlastimil Babka [Wed, 5 Dec 2018 00:13:30 +0000 (11:13 +1100)]
include/linux/slab.h: fix sparse warning in kmalloc_type()
Multiple people have reported the following sparse warning:
./include/linux/slab.h:332:43: warning: dubious: x & !y
The minimal fix would be to change the logical & to boolean &&, which
emits the same code, but Andrew has suggested that the branch-avoiding
tricks are maybe not worthwile. David Laight provided a nice comparison
of disassembly of multiple variants, which shows that the current version
produces a 4 deep dependency chain, and fixing the sparse warning by
changing logical and to multiplication emits an IMUL, making it even more
expensive.
The code as rewritten by this patch yielded the best disassembly, with a
single predictable branch for the most common case, and a ternary operator
for the rest, which gcc seems to compile without a branch or cmov by
itself.
The result should be more readable, without a sparse warning and probably
also faster for the common case.
Link: http://lkml.kernel.org/r/80340595-d7c5-97b9-4f6c-23fa893a91e9@suse.cz Fixes: 1291523f2c1d ("mm, slab/slub: introduce kmalloc-reclaimable caches") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Bart Van Assche <bvanassche@acm.org> Reported-by: Darryl T. Agostinelli <dagostinelli@gmail.com> Reported-by: Masahiro Yamada <yamada.masahiro@socionext.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Suggested-by: David Laight <David.Laight@ACULAB.COM> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wei Yang [Wed, 5 Dec 2018 00:13:29 +0000 (11:13 +1100)]
mm/slub.c: improve performance by skipping checked node in get_any_partial()
1. Background
Current slub has three layers:
* cpu_slab
* percpu_partial
* per node partial list
Slub allocator tries to get an object from top to bottom. When it
can't get an object from the upper two layers, it will search the per
node partial list. The is done in get_partial().
The idea behind this is: first try a local node, then try other nodes
if caller doesn't specify a node.
2. Room for Improvement
When we look one step deeper in get_any_partial(), it tries to get a
proper node by for_each_zone_zonelist(), which iterates on the
node_zonelists.
This behavior would introduce some redundant check on the same node.
Because:
* the local node is already checked in get_partial_node()
* one node may have several zones on node_zonelists
3. Solution Proposed in Patch
We could reduce these redundant check by record the last unsuccessful
node and then skip it.
4. Tests & Result
After some tests, the result shows this may improve the system a little,
especially on a machine with only one node.
4.1 Test Description
There are two cases for two system configurations.
Test Cases:
1. counter comparison
2. kernel build test
System Configuration:
1. One node machine with 4G
2. Four node machine with 8G
4.2 Result for Test 1
Test 1: counter comparison
This is a test with hacked kernel to record times function
get_any_partial() is invoked and times the inner loop iterates. By
comparing the ratio of two counters, we get to know how many inner
loops we skipped.
Here is a snip of the test patch.
---
static void *get_any_partial() {
get_partial_count++;
do {
for_each_zone_zonelist() {
get_partial_try_count++;
}
} while();
return NULL;
}
---
The result of (get_partial_count / get_partial_try_count):
Link: http://lkml.kernel.org/r/20181120033119.30013-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wei Yang [Wed, 5 Dec 2018 00:13:29 +0000 (11:13 +1100)]
mm/slub.c: record final state of slub action in deactivate_slab()
If __cmpxchg_double_slab() fails and (l != m), current code records
transition states of slub action.
Update the action after __cmpxchg_double_slab() success to record the
final state.
Link: http://lkml.kernel.org/r/20181107013119.3816-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wei Yang [Wed, 5 Dec 2018 00:13:29 +0000 (11:13 +1100)]
mm/slub.c: page is always non-NULL in node_match()
node_match() is a static function and is only invoked in slub.c.
In all three places, `page' is ensured to be valid.
Link: http://lkml.kernel.org/r/20181106150245.1668-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Wei Yang [Wed, 5 Dec 2018 00:13:28 +0000 (11:13 +1100)]
mm/slub.c: remove validation on cpu_slab in __flush_cpu_slab()
cpu_slab is a per cpu variable which is allocated in all or none. If a
cpu_slab failed to be allocated, the slub is not usable.
We could use cpu_slab without validation in __flush_cpu_slab().
Link: http://lkml.kernel.org/r/20181103141218.22844-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Josh Hunt [Wed, 5 Dec 2018 00:13:28 +0000 (11:13 +1100)]
block: restore /proc/partitions to not display non-partitionable removable devices
We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before.
Looking into this I found that commit d27769ec ("block: add
GENHD_FL_NO_PART_SCAN") introduces this change in behavior. It's not
clear to me from the commit's changelog if this change was intentional or
not. This comment still remains: /* Don't show non-partitionable
removeable devices or empty devices */ so I've decided to send a patch to
restore the behavior of not printing unpartitionable removable devices.
Signed-off-by: Josh Hunt <johunt@akamai.com> Cc: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
#42: FILE: fs/ocfs2/aops.c:2155:
+ unsigned i_blkbits = inode->i_sb->s_blocksize_bits;
ERROR: code indent should use tabs where possible
#53: FILE: fs/ocfs2/aops.c:2166:
+ ^I * "pos" and "end", we need map twice to return different buffer state:$
WARNING: please, no space before tabs
#53: FILE: fs/ocfs2/aops.c:2166:
+ ^I * "pos" and "end", we need map twice to return different buffer state:$
ERROR: code indent should use tabs where possible
#54: FILE: fs/ocfs2/aops.c:2167:
+ ^I * 1. area in file size, not set NEW;$
WARNING: please, no space before tabs
#54: FILE: fs/ocfs2/aops.c:2167:
+ ^I * 1. area in file size, not set NEW;$
ERROR: code indent should use tabs where possible
#55: FILE: fs/ocfs2/aops.c:2168:
+ ^I * 2. area out file size, set NEW.$
WARNING: please, no space before tabs
#55: FILE: fs/ocfs2/aops.c:2168:
+ ^I * 2. area out file size, set NEW.$
ERROR: code indent should use tabs where possible
#56: FILE: fs/ocfs2/aops.c:2169:
+ ^I *$
WARNING: please, no space before tabs
#56: FILE: fs/ocfs2/aops.c:2169:
+ ^I *$
ERROR: code indent should use tabs where possible
#57: FILE: fs/ocfs2/aops.c:2170:
+ ^I *^I^I iblock endblk$
WARNING: please, no space before tabs
#57: FILE: fs/ocfs2/aops.c:2170:
+ ^I *^I^I iblock endblk$
ERROR: code indent should use tabs where possible
#58: FILE: fs/ocfs2/aops.c:2171:
+ ^I * |--------|---------|---------|---------$
WARNING: please, no space before tabs
#58: FILE: fs/ocfs2/aops.c:2171:
+ ^I * |--------|---------|---------|---------$
ERROR: code indent should use tabs where possible
#59: FILE: fs/ocfs2/aops.c:2172:
+ ^I * |<-------area in file------->|$
WARNING: please, no space before tabs
#59: FILE: fs/ocfs2/aops.c:2172:
+ ^I * |<-------area in file------->|$
ERROR: code indent should use tabs where possible
#60: FILE: fs/ocfs2/aops.c:2173:
+ ^I */$
WARNING: please, no space before tabs
#60: FILE: fs/ocfs2/aops.c:2173:
+ ^I */$
total: 8 errors, 9 warnings, 40 lines checked
NOTE: For some of the reported defects, checkpatch may be able to
mechanically convert to the typical style using --fix or --fix-inplace.
NOTE: Whitespace errors detected.
You may wish to use scripts/cleanpatch or scripts/cleanfile
./patches/ocfs2-clear-zero-in-unaligned-direct-io.patch has style problems, please review.
NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Jia Guo <guojia12@huawei.com> Cc: Yiwen Jiang <jiangyiwen@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Jia Guo [Wed, 5 Dec 2018 00:13:27 +0000 (11:13 +1100)]
ocfs2: clear zero in unaligned direct IO
Unused portion of a part-written fs-block-sized block is not set to zero
in unaligned append direct write.This can lead to serious data
inconsistencies.
Ocfs2 manage disk with cluster size(for example, 1M), part-written in one
cluster will change the cluster state from UN-WRITTEN to WRITTEN,
VFS(function dio_zero_block) doesn't do the cleaning because bh's state is
not set to NEW in function ocfs2_dio_wr_get_block when we write a WRITTEN
cluster. For example, the cluster size is 1M, file size is 8k and we
direct write from 14k to 15k, then 12k~14k and 15k~16k will contain dirty
data.
We have to deal with two cases:
1.The starting position of direct write is outside the file.
2.The starting position of direct write is located in the file.
We need set bh's state to NEW in the first case. In the second case, we
need mapped twice because bh's state of area out file should be set to NEW
while area in file not.
Link: http://lkml.kernel.org/r/5292e287-8f1a-fd4a-1a14-661e555e0bed@huawei.com Signed-off-by: Jia Guo <guojia12@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Junxiao Bi [Wed, 5 Dec 2018 00:13:27 +0000 (11:13 +1100)]
ocfs2: don't clear bh uptodate for block read
For sync io read in ocfs2_read_blocks_sync(), first clear bh uptodate flag
and submit the io, second wait io done, last check whether bh uptodate, if
not return io error.
If two sync io for the same bh were issued, it could be the first io done
and set uptodate flag, but just before check that flag, the second io came
in and cleared uptodate, then ocfs2_read_blocks_sync() for the first io
will return IO error.
Indeed it's not necessary to clear uptodate flag, as the io end handler
end_buffer_read_sync() will set or clear it based on io succeed or failed.
The following message was found from a nfs server but the underlying
storage returned no error.
[4106438.567376] (nfsd,7146,3):ocfs2_get_suballoc_slot_bit:2780 ERROR: read block 1238823695 failed -5
[4106438.567569] (nfsd,7146,3):ocfs2_get_suballoc_slot_bit:2812 ERROR: status = -5
[4106438.567611] (nfsd,7146,3):ocfs2_test_inode_bit:2894 ERROR: get alloc slot and bit failed -5
[4106438.567643] (nfsd,7146,3):ocfs2_test_inode_bit:2932 ERROR: status = -5
[4106438.567675] (nfsd,7146,3):ocfs2_get_dentry:94 ERROR: test inode bit failed -5
Same issue in non sync read ocfs2_read_blocks(), fixed it as well.
Link: http://lkml.kernel.org/r/20181121020023.3034-4-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Changwei Ge <ge.changwei@h3c.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mfasheh@versity.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Junxiao Bi [Wed, 5 Dec 2018 00:13:27 +0000 (11:13 +1100)]
ocfs2: clear journal dirty flag after shutdown journal
Dirty flag of the journal should be cleared at the last stage of umount,
if do it before jbd2_journal_destroy(), then some metadata in uncommitted
transaction could be lost due to io error, but as dirty flag of journal
was already cleared, we can't find that until run a full fsck. This may
cause system panic or other corruption.
Link: http://lkml.kernel.org/r/20181121020023.3034-3-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Changwei Ge <ge.changwei@h3c.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@versity.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Junxiao Bi [Wed, 5 Dec 2018 00:13:27 +0000 (11:13 +1100)]
ocfs2: fix panic due to unrecovered local alloc
mount.ocfs2 ignore the inconsistent error that journal is clean but local
alloc is unrecovered. After mount, local alloc not empty, then reserver
cluster didn't alloc a new local alloc window, reserveration map is
empty(ocfs2_reservation_map.m_bitmap_len = 0), that triggered the
following panic.
This issue was reported at
https://oss.oracle.com/pipermail/ocfs2-devel/2015-May/010854.html and was
advised to fixed during mount. But this is a very unusual inconsistent
state, usually journal dirty flag should be cleared at the last stage of
umount until every other things go right. We may need do further debug to
check that. Any way to avoid possible futher corruption, mount should be
abort and fsck should be run.
Link: http://lkml.kernel.org/r/20181121020023.3034-2-junxiao.bi@oracle.com Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Acked-by: Joseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Larry Chen [Wed, 5 Dec 2018 00:13:27 +0000 (11:13 +1100)]
ocfs2: improve ocfs2 Makefile
Included file path was hard-wired in the ocfs2 makefile, which might
causes some confusion when compiling ocfs2 as an external module.
Say if we compile ocfs2 module as following.
cp -r /kernel/tree/fs/ocfs2 /other/dir/ocfs2
cd /other/dir/ocfs2
make -C /path/to/kernel_source M=`pwd` modules
Acutally, the compiler wil try to find included file in
/kernel/tree/fs/ocfs2, rather than the directory /other/dir/ocfs2.
To fix this little bug, we introduce the var $(src) provided by kbuild.
$(src) means the absolute path of the running kbuild file.
Link: http://lkml.kernel.org/r/20181108085546.15149-1-lchen@suse.com Signed-off-by: Larry Chen <lchen@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
zhong jiang [Wed, 5 Dec 2018 00:13:26 +0000 (11:13 +1100)]
ocfs2: remove set but not used variable 'lastzero'
lastzero is not used after setting its value. It is safe to remove the
unused variable.
Link: http://lkml.kernel.org/r/1540296942-24533-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
zhong jiang [Wed, 5 Dec 2018 00:13:26 +0000 (11:13 +1100)]
ocfs2: dlmfs: remove set but not used variable 'status'
status is not used after setting its value. It is safe to remove the
unused variable.
Link: http://lkml.kernel.org/r/1540300179-26697-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Reviewed-by: Jun Piao <piaojun@huawei.com> Acked-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Jia Guo [Wed, 5 Dec 2018 00:13:26 +0000 (11:13 +1100)]
ocfs2: optimize the reading of heartbeat data
Reading heartbeat data from lowest node rather than from zero, in cases
where the node is not defined from zero, can reduce the number of sectors
read.
Here is a simple test data obtained with 'iostat -dmx dm-5 2', with
two nodes in the cluster, node number 10, 20, respectively.
Qian Cai [Wed, 5 Dec 2018 00:13:26 +0000 (11:13 +1100)]
debugobjects: call debug_objects_mem_init eariler
The current value of the early boot static pool size, 1024 is not big
enough for systems with large number of CPUs with timer or/and workqueue
objects selected. As the results, systems have 60+ CPUs with both timer
and workqueue objects enabled could trigger "ODEBUG: Out of memory.
ODEBUG disabled".
Some debug objects are allocated during the early boot. Enabling some
options like timers or workqueue objects may increase the size required
significantly with large number of CPUs. For example,
plus No. CPUs x 6 (workqueue) objects:
workqueue_init_early
alloc_workqueue
__alloc_workqueue_key
alloc_and_link_pwqs
init_pwq
Also, plus No. CPUs objects:
perf_event_init
__init_srcu_struct
init_srcu_struct_fields
init_srcu_struct_nodes
__init_work
However, none of the things are actually used or required before
debug_objects_mem_init() is invoked, so just move the call right before
vmalloc_init().
According to tglx, "the reason why the call is at this place in
start_kernel() is historical. It's because back in the days when
debugobjects were added the memory allocator was enabled way later than
today."
Link: http://lkml.kernel.org/r/20181126102407.1836-1-cai@gmx.us Signed-off-by: Qian Cai <cai@gmx.us> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
arch/sh/boards/mach-kfr2r09/setup.c does not need to #include
<mtd/onenand.h>, and doing so causes a build warning, so drop that header
file.
In file included from ../arch/sh/boards/mach-kfr2r09/setup.c:28:
../include/linux/mtd/onenand.h:225:12: warning: 'struct mtd_oob_ops' declared inside parameter list will not be visible outside of this definition or declaration
struct mtd_oob_ops *ops);
Link: http://lkml.kernel.org/r/702f0a25-c63e-6912-4640-6ab0f00afbc7@infradead.org Fixes: f3590dc32974 ("media: arch: sh: kfr2r09: Use new renesas-ceu camera driver") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Suggested-by: Miquel Raynal <miquel.raynal@bootlin.com> Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Jacopo Mondi <jacopo+renesas@jmondi.org> Cc: Magnus Damm <magnus.damm@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:24 +0000 (11:13 +1100)]
kasan, mm, arm64: tag non slab memory allocated via pagealloc
Tag-based KASAN doesn't check memory accesses through pointers tagged with
0xff. When page_address is used to get pointer to memory that corresponds
to some page, the tag of the resulting pointer gets set to 0xff, even
though the allocated memory might have been tagged differently.
For slab pages it's impossible to recover the correct tag to return from
page_address, since the page might contain multiple slab objects tagged
with different values, and we can't know in advance which one of them is
going to get accessed. For non slab pages however, we can recover the tag
in page_address, since the whole page was marked with the same tag.
This patch adds tagging to non slab memory allocated with pagealloc. To
set the tag of the pointer returned from page_address, the tag gets stored
to page->flags when the memory gets allocated.
Link: http://lkml.kernel.org/r/6c1004acf28880f6a5cc7d2f974ba08adb2853ea.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:24 +0000 (11:13 +1100)]
kasan, arm64: add brk handler for inline instrumentation
Tag-based KASAN inline instrumentation mode (which embeds checks of shadow
memory into the generated code, instead of inserting a callback) generates
a brk instruction when a tag mismatch is detected.
This commit adds a tag-based KASAN specific brk handler, that decodes the
immediate value passed to the brk instructions (to extract information
about the memory access that triggered the mismatch), reads the register
values (x0 contains the guilty address) and reports the bug.
Link: http://lkml.kernel.org/r/e825441eda1dbbbb7f583f826a66c94e6f88316a.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:24 +0000 (11:13 +1100)]
kasan: add hooks implementation for tag-based mode
This commit adds tag-based KASAN specific hooks implementation and
adjusts common generic and tag-based KASAN ones.
1. When a new slab cache is created, tag-based KASAN rounds up the size of
the objects in this cache to KASAN_SHADOW_SCALE_SIZE (== 16).
2. On each kmalloc tag-based KASAN generates a random tag, sets the shadow
memory, that corresponds to this object to this tag, and embeds this
tag value into the top byte of the returned pointer.
3. On each kfree tag-based KASAN poisons the shadow memory with a random
tag to allow detection of use-after-free bugs.
The rest of the logic of the hook implementation is very much similar to
the one provided by generic KASAN. Tag-based KASAN saves allocation and
free stack metadata to the slab object the same way generic KASAN does.
Link: http://lkml.kernel.org/r/b10d44bace6a7e9279b9b5a5b4c2a9c4c58cbf4f.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:23 +0000 (11:13 +1100)]
mm: move obj_to_index to include/linux/slab_def.h
While with SLUB we can actually preassign tags for caches with contructors
and store them in pointers in the freelist, SLAB doesn't allow that since
the freelist is stored as an array of indexes, so there are no pointers to
store the tags.
Instead we compute the tag twice, once when a slab is created before
calling the constructor and then again each time when an object is
allocated with kmalloc. Tag is computed simply by taking the lowest byte of
the index that corresponds to the object. However in kasan_kmalloc we only
have access to the objects pointer, so we need a way to find out which
index this object corresponds to.
This patch moves obj_to_index from slab.c to include/linux/slab_def.h to
be reused by KASAN.
Link: http://lkml.kernel.org/r/b68796c554fba66d5285274ea6356e642e18a9e5.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:23 +0000 (11:13 +1100)]
kasan: add bug reporting routines for tag-based mode
This commit adds rountines, that print tag-based KASAN error reports.
Those are quite similar to generic KASAN, the difference is:
1. The way tag-based KASAN finds the first bad shadow cell (with a
mismatching tag). Tag-based KASAN compares memory tags from the shadow
memory to the pointer tag.
2. Tag-based KASAN reports all bugs with the "KASAN: invalid-access"
header.
Also simplify generic KASAN find_first_bad_addr.
Link: http://lkml.kernel.org/r/996c09ec2c8f11294c106973f3b1a211417fa74e.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:23 +0000 (11:13 +1100)]
kasan: split out generic_report.c from report.c
This patch moves generic KASAN specific error reporting routines to
generic_report.c without any functional changes, leaving common error
reporting code in report.c to be later reused by tag-based KASAN.
Link: http://lkml.kernel.org/r/9030fe246a786be1348f8b08089f30e52be23ec4.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:23 +0000 (11:13 +1100)]
kasan, mm: perform untagged pointers comparison in krealloc
The krealloc function checks where the same buffer was reused or a new one
allocated by comparing kernel pointers. Tag-based KASAN changes memory tag
on the krealloc'ed chunk of memory and therefore also changes the pointer
tag of the returned pointer. Therefore we need to perform comparison on
untagged (with tags reset) pointers to check whether it's the same memory
region or not.
Link: http://lkml.kernel.org/r/5045db8a8e249a1eda3199b952120035eacb3bd4.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:23 +0000 (11:13 +1100)]
kasan, arm64: enable top byte ignore for the kernel
Tag-based KASAN uses the Top Byte Ignore feature of arm64 CPUs to store a
pointer tag in the top byte of each pointer. This commit enables the
TCR_TBI1 bit, which enables Top Byte Ignore for the kernel, when tag-based
KASAN is used.
Link: http://lkml.kernel.org/r/1ed03d53ee679cba52ba7118d2acbef948d21fcc.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:22 +0000 (11:13 +1100)]
kasan, arm64: fix up fault handling logic
Right now arm64 fault handling code removes pointer tags from addresses
covered by TTBR0 in faults taken from both EL0 and EL1, but doesn't do
that for pointers covered by TTBR1.
This patch adds two helper functions is_ttbr0_addr() and is_ttbr1_addr(),
where the latter one accounts for the fact that TTBR1 pointers might be
tagged when tag-based KASAN is in use, and uses these helper functions to
perform pointer checks in arch/arm64/mm/fault.c.
Link: http://lkml.kernel.org/r/a54fe8c8c11948b0ac8c8b285fb36f845217c84a.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:22 +0000 (11:13 +1100)]
kasan: preassign tags to objects with ctors or SLAB_TYPESAFE_BY_RCU
An object constructor can initialize pointers within this objects based on
the address of the object. Since the object address might be tagged, we
need to assign a tag before calling constructor.
The implemented approach is to assign tags to objects with constructors
when a slab is allocated and call constructors once as usual. The
downside is that such object would always have the same tag when it is
reallocated, so we won't catch use-after-frees on it.
Also pressign tags for objects from SLAB_TYPESAFE_BY_RCU caches, since
they can be validy accessed after having been freed.
Link: http://lkml.kernel.org/r/b2c17b6674f1737f981ffa6dca7fdfc059a88435.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:22 +0000 (11:13 +1100)]
kasan, arm64: untag address in _virt_addr_is_linear
virt_addr_is_linear (which is used by virt_addr_valid) assumes that the
top byte of the address is 0xff, which isn't always the case with
tag-based KASAN.
This patch resets the tag in this macro.
Link: http://lkml.kernel.org/r/dd9cda296c70ca6b1839cf4de3ee3137cf5030e7.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:22 +0000 (11:13 +1100)]
kasan: add tag related helper functions
This commit adds a few helper functions, that are meant to be used to
work with tags embedded in the top byte of kernel pointers: to set, to
get or to reset the top byte.
Andrey Konovalov [Wed, 5 Dec 2018 00:13:22 +0000 (11:13 +1100)]
arm64: move untagged_addr macro from uaccess.h to memory.h
Move the untagged_addr() macro from arch/arm64/include/asm/uaccess.h
to arch/arm64/include/asm/memory.h to be later reused by KASAN.
Also make the untagged_addr() macro accept all kinds of address types
(void *, unsigned long, etc.). This allows not to specify type casts in
each place where the macro is used. This is done by using __typeof__.
Andrey Konovalov [Wed, 5 Dec 2018 00:13:21 +0000 (11:13 +1100)]
kasan: initialize shadow to 0xff for tag-based mode
A tag-based KASAN shadow memory cell contains a memory tag, that
corresponds to the tag in the top byte of the pointer, that points to that
memory. The native top byte value of kernel pointers is 0xff, so with
tag-based KASAN we need to initialize shadow memory to 0xff.
Link: http://lkml.kernel.org/r/9004fd16d56d8772775cf671a8fa66e54ed138dd.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:21 +0000 (11:13 +1100)]
kasan: rename kasan_zero_page to kasan_early_shadow_page
With tag based KASAN mode the early shadow value is 0xff and not 0x00,
so this patch renames kasan_zero_(page|pte|pmd|pud|p4d) to
kasan_early_shadow_(page|pte|pmd|pud|p4d) to avoid confusion.
Link: http://lkml.kernel.org/r/1aa5a8e562380e0bfb1d58c8ec3130436cff8b47.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:21 +0000 (11:13 +1100)]
kasan: add CONFIG_KASAN_GENERIC and CONFIG_KASAN_SW_TAGS
This commit splits the current CONFIG_KASAN config option into two:
1. CONFIG_KASAN_GENERIC, that enables the generic KASAN mode (the one
that exists now);
2. CONFIG_KASAN_SW_TAGS, that enables the software tag-based KASAN mode.
The name CONFIG_KASAN_SW_TAGS is chosen as in the future we will have
another hardware tag-based KASAN mode, that will rely on hardware memory
tagging support in arm64.
With CONFIG_KASAN_SW_TAGS enabled, compiler options are changed to
instrument kernel files with -fsantize=kernel-hwaddress (except the ones
for which KASAN_SANITIZE := n is set).
Both CONFIG_KASAN_GENERIC and CONFIG_KASAN_SW_TAGS support both
CONFIG_KASAN_INLINE and CONFIG_KASAN_OUTLINE instrumentation modes.
This commit also adds empty placeholder (for now) implementation of
tag-based KASAN specific hooks inserted by the compiler and adjusts
common hooks implementation.
While this commit adds the CONFIG_KASAN_SW_TAGS config option, this option
is not selectable, as it depends on HAVE_ARCH_KASAN_SW_TAGS, which we will
enable once all the infrastracture code has been added.
Link: http://lkml.kernel.org/r/20728567aae93b5eb88a6636c94c1af73db7cdbc.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:20 +0000 (11:13 +1100)]
kasan: rename source files to reflect the new naming scheme
We now have two KASAN modes: generic KASAN and tag-based KASAN. Rename
kasan.c to generic.c to reflect that. Also rename kasan_init.c to init.c
as it contains initialization code for both KASAN modes.
Link: http://lkml.kernel.org/r/a6cd5acdc334d532754279a7d2d770e2be96419d.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:20 +0000 (11:13 +1100)]
kasan, slub: handle pointer tags in early_kmem_cache_node_alloc()
The previous patch updated KASAN hooks signatures and their usage in SLAB
and SLUB code, except for the early_kmem_cache_node_alloc function. This
patch handles that function separately, as it requires to reorder some of
the initialization code to correctly propagate a tagged pointer in case a
tag is assigned by kasan_kmalloc.
Link: http://lkml.kernel.org/r/3459b9f5d4daf96668d9579da2cf15eb5523284b.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrey Konovalov [Wed, 5 Dec 2018 00:13:20 +0000 (11:13 +1100)]
kasan, mm: change hooks signatures
Patch series "kasan: add software tag-based mode for arm64", v12.
This patchset adds a new software tag-based mode to KASAN [1].
(Initially this mode was called KHWASAN, but it got renamed,
see the naming rationale at the end of this section).
The plan is to implement HWASan [2] for the kernel with the incentive,
that it's going to have comparable to KASAN performance, but in the same
time consume much less memory, trading that off for somewhat imprecise
bug detection and being supported only for arm64.
The underlying ideas of the approach used by software tag-based KASAN are:
1. By using the Top Byte Ignore (TBI) arm64 CPU feature, we can store
pointer tags in the top byte of each kernel pointer.
2. Using shadow memory, we can store memory tags for each chunk of kernel
memory.
3. On each memory allocation, we can generate a random tag, embed it into
the returned pointer and set the memory tags that correspond to this
chunk of memory to the same value.
4. By using compiler instrumentation, before each memory access we can add
a check that the pointer tag matches the tag of the memory that is being
accessed.
5. On a tag mismatch we report an error.
With this patchset the existing KASAN mode gets renamed to generic KASAN,
with the word "generic" meaning that the implementation can be supported
by any architecture as it is purely software.
The new mode this patchset adds is called software tag-based KASAN. The
word "tag-based" refers to the fact that this mode uses tags embedded into
the top byte of kernel pointers and the TBI arm64 CPU feature that allows
to dereference such pointers. The word "software" here means that shadow
memory manipulation and tag checking on pointer dereference is done in
software. As it is the only tag-based implementation right now, "software
tag-based" KASAN is sometimes referred to as simply "tag-based" in this
patchset.
A potential expansion of this mode is a hardware tag-based mode, which would
use hardware memory tagging support (announced by Arm [3]) instead of
compiler instrumentation and manual shadow memory manipulation.
Same as generic KASAN, software tag-based KASAN is strictly a debugging
feature.
On mobile devices generic KASAN's memory usage is significant problem. One
of the main reasons to have tag-based KASAN is to be able to perform a
similar set of checks as the generic one does, but with lower memory
requirements.
Comment from Vishwath Mohan <vishwath@google.com>:
I don't have data on-hand, but anecdotally both ASAN and KASAN have proven
problematic to enable for environments that don't tolerate the increased
memory pressure well. This includes,
(a) Low-memory form factors - Wear, TV, Things, lower-tier phones like Go,
(c) Connected components like Pixel's visual core [1].
These are both places I'd love to have a low(er) memory footprint option at
my disposal.
Comment from Evgenii Stepanov <eugenis@google.com>:
Looking at a live Android device under load, slab (according to
/proc/meminfo) + kernel stack take 8-10% available RAM (~350MB). KASAN's
overhead of 2x - 3x on top of it is not insignificant.
Not having this overhead enables near-production use - ex. running
KASAN/KHWASAN kernel on a personal, daily-use device to catch bugs that do
not reproduce in test configuration. These are the ones that often cost
the most engineering time to track down.
CPU overhead is bad, but generally tolerable. RAM is critical, in our
experience. Once it gets low enough, OOM-killer makes your life miserable.
Software tag-based KASAN mode is implemented in a very similar way to the
generic one. This patchset essentially does the following:
1. TCR_TBI1 is set to enable Top Byte Ignore.
2. Shadow memory is used (with a different scale, 1:16, so each shadow
byte corresponds to 16 bytes of kernel memory) to store memory tags.
3. All slab objects are aligned to shadow scale, which is 16 bytes.
4. All pointers returned from the slab allocator are tagged with a random
tag and the corresponding shadow memory is poisoned with the same value.
5. Compiler instrumentation is used to insert tag checks. Either by
calling callbacks or by inlining them (CONFIG_KASAN_OUTLINE and
CONFIG_KASAN_INLINE flags are reused).
6. When a tag mismatch is detected in callback instrumentation mode
KASAN simply prints a bug report. In case of inline instrumentation,
clang inserts a brk instruction, and KASAN has it's own brk handler,
which reports the bug.
7. The memory in between slab objects is marked with a reserved tag, and
acts as a redzone.
8. When a slab object is freed it's marked with a reserved tag.
Bug detection is imprecise for two reasons:
1. We won't catch some small out-of-bounds accesses, that fall into the
same shadow cell, as the last byte of a slab object.
2. We only have 1 byte to store tags, which means we have a 1/256
probability of a tag match for an incorrect access (actually even
slightly less due to reserved tag values).
Despite that there's a particular type of bugs that tag-based KASAN can
detect compared to generic KASAN: use-after-free after the object has been
allocated by someone else.
====== Testing
Some kernel developers voiced a concern that changing the top byte of
kernel pointers may lead to subtle bugs that are difficult to discover.
To address this concern deliberate testing has been performed.
It doesn't seem feasible to do some kind of static checking to find
potential issues with pointer tagging, so a dynamic approach was taken.
All pointer comparisons/subtractions have been instrumented in an LLVM
compiler pass and a kernel module that would print a bug report whenever
two pointers with different tags are being compared/subtracted (ignoring
comparisons with NULL pointers and with pointers obtained by casting an
error code to a pointer type) has been used. Then the kernel has been
booted in QEMU and on an Odroid C2 board and syzkaller has been run.
This yielded the following results.
The two places that look interesting are:
is_vmalloc_addr in include/linux/mm.h
is_kernel_rodata in mm/util.c
Here we compare a pointer with some fixed untagged values to make sure
that the pointer lies in a particular part of the kernel address space.
Since tag-based KASAN doesn't add tags to pointers that belong to rodata
or vmalloc regions, this should work as is. To make sure debug checks to
those two functions that check that the result doesn't change whether
we operate on pointers with or without untagging has been added.
A few other cases that don't look that interesting:
Comparing pointers to achieve unique sorting order of pointee objects
(e.g. sorting locks addresses before performing a double lock):
tty_ldisc_lock_pair_timeout in drivers/tty/tty_ldisc.c
pipe_double_lock in fs/pipe.c
unix_state_double_lock in net/unix/af_unix.c
lock_two_nondirectories in fs/inode.c
mutex_lock_double in kernel/events/core.c
ep_cmp_ffd in fs/eventpoll.c
fsnotify_compare_groups fs/notify/mark.c
Nothing needs to be done here, since the tags embedded into pointers
don't change, so the sorting order would still be unique.
Checks that a pointer belongs to some particular allocation:
is_sibling_entry in lib/radix-tree.c
object_is_on_stack in include/linux/sched/task_stack.h
Nothing needs to be done here either, since two pointers can only belong
to the same allocation if they have the same tag.
Overall, since the kernel boots and works, there are no critical bugs.
As for the rest, the traditional kernel testing way (use until fails) is
the only one that looks feasible.
Another point here is that tag-based KASAN is available under a separate
config option that needs to be deliberately enabled. Even though it might
be used in a "near-production" environment to find bugs that are not found
during fuzzing or running tests, it is still a debug tool.
====== Benchmarks
The following numbers were collected on Odroid C2 board. Both generic and
tag-based KASAN were used in inline instrumentation mode.
Boot time [1]:
* ~1.7 sec for clean kernel
* ~5.0 sec for generic KASAN
* ~5.0 sec for tag-based KASAN
Network performance [2]:
* 8.33 Gbits/sec for clean kernel
* 3.17 Gbits/sec for generic KASAN
* 2.85 Gbits/sec for tag-based KASAN
Slab memory usage after boot [3]:
* ~40 kb for clean kernel
* ~105 kb (~260% overhead) for generic KASAN
* ~47 kb (~20% overhead) for tag-based KASAN
KASAN memory overhead consists of three main parts:
1. Increased slab memory usage due to redzones.
2. Shadow memory (the whole reserved once during boot).
3. Quaratine (grows gradually until some preset limit; the more the limit,
the more the chance to detect a use-after-free).
Comparing tag-based vs generic KASAN for each of these points:
1. 20% vs 260% overhead.
2. 1/16th vs 1/8th of physical memory.
3. Tag-based KASAN doesn't require quarantine.
[1] Time before the ext4 driver is initialized.
[2] Measured as `iperf -s & iperf -c 127.0.0.1 -t 30`.
[3] Measured as `cat /proc/meminfo | grep Slab`.
====== Some notes
A few notes:
1. The patchset can be found here:
https://github.com/xairy/kasan-prototype/tree/khwasan
2. Building requires a recent Clang version (7.0.0 or later).
3. Stack instrumentation is not supported yet and will be added later.
This patch (of 25):
Tag-based KASAN changes the value of the top byte of pointers returned
from the kernel allocation functions (such as kmalloc). This patch
updates KASAN hooks signatures and their usage in SLAB and SLUB code to
reflect that.
Link: http://lkml.kernel.org/r/10068968c5dbdd1913bd60f89262e3c1a50f3d38.1543337629.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
VM_DATA_DEFAULT_FLAGS uses READ_IMPLIES_EXEC, so page.h should include
personality.h to provide this.
This fixes no known bugs and can be safely ignored ;)
Cc: Russell King <linux@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Piotr Jaroszynski [Wed, 5 Dec 2018 00:13:19 +0000 (11:13 +1100)]
fs/iomap.c: get/put the page in iomap_page_create/release()
migrate_page_move_mapping() expects pages with private data set to have a
page_count elevated by 1. This is what used to happen for xfs through the
buffer_heads code before the switch to iomap in commit 82cb14175e7d ("xfs:
add support for sub-pagesize writeback without buffer_heads"). Not having
the count elevated causes move_pages() to fail on memory mapped files
coming from xfs.
Make iomap compatible with the migrate_page_move_mapping() assumption by
elevating the page count as part of iomap_page_create() and lowering it in
iomap_page_release().
Link: http://lkml.kernel.org/r/20181115184140.1388751-1-pjaroszynski@nvidia.com Fixes: 82cb14175e7d ("xfs: add support for sub-pagesize writeback without buffer_heads") Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Brian Foster <bfoster@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Yongkai Wu [Wed, 5 Dec 2018 00:13:19 +0000 (11:13 +1100)]
hugetlbfs: call VM_BUG_ON_PAGE earlier in free_huge_page()
A stack trace was triggered by VM_BUG_ON_PAGE(page_mapcount(page), page)
in free_huge_page(). Unfortunately, the page->mapping field was set to
NULL before this test. This made it more difficult to determine the root
cause of the problem.
Move the VM_BUG_ON_PAGE tests earlier in the function so that if they do
trigger more information is present in the page struct.
Link: http://lkml.kernel.org/r/1543491843-23438-1-git-send-email-nic_w@163.com Signed-off-by: Yongkai Wu <nic_w@163.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Yueyi Li [Wed, 5 Dec 2018 00:13:19 +0000 (11:13 +1100)]
memblock: annotate memblock_is_reserved() with __init_memblock
Found warning:
WARNING: EXPORT symbol "gsi_write_channel_scratch" [vmlinux] version generation failed, symbol will not be versioned.
WARNING: vmlinux.o(.text+0x1e0a0): Section mismatch in reference from the function valid_phys_addr_range() to the function .init.text:memblock_is_reserved()
The function valid_phys_addr_range() references
the function __init memblock_is_reserved().
This is often because valid_phys_addr_range lacks a __init
annotation or the annotation of memblock_is_reserved is wrong.
Baruch Siach [Wed, 5 Dec 2018 00:13:19 +0000 (11:13 +1100)]
psi: fix reference to kernel commandline enable
The kernel commandline parameter named in CONFIG_PSI_DEFAULT_DISABLED
help text contradicts the documentation in kernel-parameters.txt, and
the code. Fix that.
Link: http://lkml.kernel.org/r/20181203213416.GA12627@cmpxchg.org Fixes: e0c274472d ("psi: make disabling/enabling easier for vendor kernels") Signed-off-by: Baruch Siach <baruch@tkos.co.il> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Jian Wang [Wed, 5 Dec 2018 00:13:19 +0000 (11:13 +1100)]
ocfs2/dlm: return DLM_CANCELGRANT if the lock is on granted list and the operation is canceled
In dlm_move_lockres_to_recovery_list(), if the lock is in the granted
queue and cancel_pending is set, it will encounter a BUG. I think this is
a meaningless BUG, so be prepared to remove it. A scenario that causes
this BUG will be given below.
At the beginning, Node 1 is the master and has NL lock, Node 2 has PR
lock, Node 3 has PR lock too.
Node 1 Node 2 Node 3
want to get EX lock.
want to get EX lock.
Node 3 hinder
Node 2 to get
EX lock, send
Node 3 a BAST.
receive BAST from
Node 1. downconvert
thread begin to
cancel PR to EX conversion.
In dlmunlock_common function,
downconvert thread has set
lock->cancel_pending,
but did not enter
dlm_send_remote_unlock_request
function.
Node2 dies because
the host is powered down.
In recovery process,
clean the lock that
related to Node2.
then finish Node 3
PR to EX request.
give Node 3 a AST.
receive AST from Node 1.
change lock level to EX,
move lock to granted list.
Node1 dies because
the host is powered down.
In dlm_move_lockres_to_recovery_list
function. the lock is in the
granted queue and cancel_pending
is set. BUG_ON.
But after clearing this BUG, process will encounter
the second BUG in the ocfs2_unlock_ast function.
Here is a scenario that will cause the second BUG
in ocfs2_unlock_ast as follows:
At the beginning, Node 1 is the master and has NL lock,
Node 2 has PR lock, Node 3 has PR lock too.
Node 1 Node 2 Node 3
want to get EX lock.
want to get EX lock.
Node 3 hinder
Node 2 to get
EX lock, send
Node 3 a BAST.
receive BAST from
Node 1. downconvert
thread begin to
cancel PR to EX conversion.
In dlmunlock_common function,
downconvert thread has released
lock->spinlock and res->spinlock,
but did not enter
dlm_send_remote_unlock_request
function.
Node2 dies because
the host is powered down.
In recovery process,
clean the lock that
related to Node2.
then finish Node 3
PR to EX request.
give Node 3 a AST.
receive AST from Node 1.
change lock level to EX,
move lock to granted list,
set lockres->l_unlock_action
as OCFS2_UNLOCK_INVALID
in ocfs2_locking_ast function.
Node2 dies because
the host is powered down.
Node 3 realize that Node 1
is dead, remove Node 1 from
domain_map. downconvert thread
get DLM_NORMAL from
dlm_send_remote_unlock_request
function and set *call_ast as 1.
Then downconvert thread meet
BUG in ocfs2_unlock_ast function.
To avoid meet the second BUG, dlmunlock_common() should return
DLM_CANCELGRANT if the lock is on granted list and the operation is
canceled.
Link: http://lkml.kernel.org/r/98f0e80c-9c13-dbb6-047c-b40e100082b1@huawei.com Signed-off-by: Jian Wang <wangjian161@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Jian Wang [Wed, 5 Dec 2018 00:13:18 +0000 (11:13 +1100)]
ocfs2/dlm: clean DLM_LKSB_GET_LVB and DLM_LKSB_PUT_LVB when the cancel_pending is set
dlm_move_lockres_to_recovery_list() should clean DLM_LKSB_GET_LVB and
DLM_LKSB_PUT_LVB when the cancel_pending is set. Otherwise node may panic
in dlm_proxy_ast_handler.
Here is the situation: At the beginning, Node1 is the master of the lock
resource and has NL lock, Node2 has PR lock, Node3 has PR lock, Node4 has
NL lock.
Node1 Node2 Node3 Node4
convert lock_2 from
PR to EX.
the mode of lock_3 is
PR, which blocks the
conversion request of
Node2. move lock_2 to
conversion list.
convert lock_3 from
PR to EX.
move lock_3 to conversion
list. send BAST to Node3.
receive BAST from Node1.
downconvert thread execute
canceling convert operation.
Node2 dies because
the host is powered down.
in dlmunlock_common function,
the downconvert thread set
cancel_pending. at the same
time, Node 3 realized that
Node 1 is dead, so move lock_3
back to granted list in
dlm_move_lockres_to_recovery_list
function and remove Node 1 from
the domain_map in
__dlm_hb_node_down function.
then downconvert thread failed
to send the lock cancellation
request to Node1 and return
DLM_NORMAL from
dlm_send_remote_unlock_request
function.
become recovery master.
during the recovery
process, send
lock_2 that is
converting form
PR to EX to Node4.
during the recovery process,
send lock_3 in the granted list and
cantain the DLM_LKSB_GET_LVB
flag to Node4. Then downconvert thread
delete DLM_LKSB_GET_LVB flag in
dlmunlock_common function.
Node4 finish recovery.
the mode of lock_3 is
PR, which blocks the
conversion request of
Node2, so send BAST
to Node3.
receive BAST from Node4.
convert lock_3 from PR to NL.
change the mode of lock_3
from PR to NL and send
message to Node3.
receive message from
Node4. The message contain
LKM_GET_LVB flag, but the
lock->lksb->flags does not
contain DLM_LKSB_GET_LVB,
BUG_ON in dlm_proxy_ast_handler
function.
Link: http://lkml.kernel.org/r/bffbe5a5-acb6-652d-eb8a-99fb051e6631@huawei.com Signed-off-by: Jian Wang <wangjian161@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mike Kravetz [Wed, 5 Dec 2018 00:13:18 +0000 (11:13 +1100)]
hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
hugetlbfs page faults can race with truncate and hole punch operations.
Current code in the page fault path attempts to handle this by 'backing
out' operations if we encounter the race. One obvious omission in the
current code is removing a page newly added to the page cache. This is
pretty straight forward to address, but there is a more subtle and
difficult issue of backing out hugetlb reservations. To handle this
correctly, the 'reservation state' before page allocation needs to be
noted so that it can be properly backed out. There are four distinct
possibilities for reservation state: shared/reserved, shared/no-resv,
private/reserved and private/no-resv. Backing out a reservation may
require memory allocation which could fail so that needs to be taken into
account as well.
Instead of writing the required complicated code for this rare occurrence,
just eliminate the race. i_mmap_rwsem is now held in read mode for the
duration of page fault processing. Hold i_mmap_rwsem longer in truncation
and hold punch code to cover the call to remove_inode_hugepages.
Link: http://lkml.kernel.org/r/20181203200850.6460-3-mike.kravetz@oracle.com Fixes: ebed4bfc8da8 ("hugetlb: fix absurd HugePages_Rsvd") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mike Kravetz [Wed, 5 Dec 2018 00:13:18 +0000 (11:13 +1100)]
hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
Patch series "hugetlbfs: use i_mmap_rwsem for better synchronization".
These patches are a follow up to the RFC,
http://lkml.kernel.org/r/20181024045053.1467-1-mike.kravetz@oracle.com
Comments made by Naoya were addressed.
There are two primary issues addressed here:
1) For shared pmds, huge PE pointers returned by huge_pte_alloc can
become invalid via a call to huge_pmd_unshare by another thread.
2) hugetlbfs page faults can race with truncation causing invalid
global reserve counts and state.
Both issues are addressed by expanding the use of i_mmap_rwsem.
These issues have existed for a long time. They can be recreated with a
test program that causes page fault/truncation races. For simple
mappings, this results in a negative HugePages_Rsvd count. If racing with
mappings that contain shared pmds, we can hit "BUG at
fs/hugetlbfs/inode.c:444!" or Oops! as the result of an invalid memory
reference.
I broke up the larger RFC into separate patches addressing each issue.
Hopefully, this is easier to understand/review.
This patch (of 3):
While looking at BUGs associated with invalid huge page map counts, it was
discovered and observed that a huge pte pointer could become 'invalid' and
point to another task's page table. Consider the following:
A task takes a page fault on a shared hugetlbfs file and calls
huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
shared pmd.
Now, another task truncates the hugetlbfs file. As part of truncation, it
unmaps everyone who has the file mapped. If the range being truncated is
covered by a shared pmd, huge_pmd_unshare will be called. For all but the
last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
to the pmd. If the task in the middle of the page fault is not the last
user, the ptep returned by huge_pte_alloc now points to another task's
page table or worse. This leads to bad things such as incorrect page
map/reference counts or invalid memory references.
To fix, expand the use of i_mmap_rwsem as follows:
- i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
huge_pmd_share is only called via huge_pte_alloc, so callers of
huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
of huge_pte_alloc continue to hold the semaphore until finished with the
ptep.
- i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
Link: http://lkml.kernel.org/r/20181203200850.6460-2-mike.kravetz@oracle.com Fixes: 39dde65c9940 ("shared page table for hugetlb page") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
David Rientjes [Wed, 5 Dec 2018 00:13:17 +0000 (11:13 +1100)]
mm, thp: always specify disabled vmas as nh in smaps
1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") introduced
a regression in that userspace cannot always determine the set of vmas
where thp is disabled.
Userspace relies on the "nh" flag being emitted as part of /proc/pid/smaps
to determine if a vma has been disabled from being backed by hugepages.
Previous to this commit, prctl(PR_SET_THP_DISABLE, 1) would cause thp to
be disabled and emit "nh" as a flag for the corresponding vmas as part of
/proc/pid/smaps. After the commit, thp is disabled by means of an mm flag
and "nh" is not emitted.
This causes smaps parsing libraries to assume a vma is enabled for thp and
ends up puzzling the user on why its memory is not backed by thp.
This also clears the "hg" flag to make the behavior of MADV_HUGEPAGE and
PR_SET_THP_DISABLE definitive.
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1809251449060.96762@chino.kir.corp.google.com Fixes: 1860033237d4 ("mm: make PR_SET_THP_DISABLE immediately active") Signed-off-by: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>