]> www.infradead.org Git - users/jedix/linux-maple.git/log
users/jedix/linux-maple.git
3 years agomm/mmap: Use advanced maple tree API for mmap_region()
Liam R. Howlett [Tue, 10 Nov 2020 18:37:40 +0000 (13:37 -0500)]
mm/mmap: Use advanced maple tree API for mmap_region()

Changing mmap_region() to use the maple tree state and the advanced
maple tree interface allows for a lot less tree walking.

This change removes the last caller of munmap_vma_range(), so drop this
unused function.

Add vma_expand() to expand a VMA if possible by doing the necessary
hugepage check, uprobe_munmap of files, dcache flush, modifications then
undoing the detaches, etc.

Add vma_mas_link() helper to add a VMA to the linked list and maple tree
until the linked list is removed.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm: Use maple tree operations for find_vma_intersection() and find_vma()
Liam R. Howlett [Mon, 28 Sep 2020 19:50:19 +0000 (15:50 -0400)]
mm: Use maple tree operations for find_vma_intersection() and find_vma()

Move find_vma_intersection() to mmap.c and change implementation to
maple tree.

When searching for a vma within a range, it is easier to use the maple
tree interface.  This means the find_vma() call changes to a special
case of the find_vma_intersection().

Exported for kvm module.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/mmap: Change do_brk_flags() to expand existing VMA and add
Liam R. Howlett [Mon, 21 Sep 2020 14:47:34 +0000 (10:47 -0400)]
mm/mmap: Change do_brk_flags() to expand existing VMA and add
do_brk_munmap()

Avoid allocating a new VMA when it a vma modification can occur.  When a
brk() can expand or contract a VMA, then the single store operation will
only modify one index of the maple tree instead of causing a node to
split or coalesce.  This avoids unnecessary allocations/frees of maple
tree nodes and VMAs.

Use the advanced API for the maple tree to avoid unnecessary walks of
the tree.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/khugepaged: Optimize collapse_pte_mapped_thp() by using vma_lookup()
Liam R. Howlett [Thu, 8 Apr 2021 20:21:58 +0000 (16:21 -0400)]
mm/khugepaged: Optimize collapse_pte_mapped_thp() by using vma_lookup()

vma_lookup() will walk the vma tree once and not continue to look for
the next vma.  Since the exact vma is checked below, this is a more
optimal way of searching.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm: Optimize find_exact_vma() to use vma_lookup()
Liam R. Howlett [Thu, 8 Apr 2021 20:15:00 +0000 (16:15 -0400)]
mm: Optimize find_exact_vma() to use vma_lookup()

Use vma_lookup() to walk the tree to the start value requested.  If
the vma at the start does not match, then the answer is NULL and there
is no need to look at the next vma the way that find_vma() would.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoxen/privcmd: Optimized privcmd_ioctl_mmap() by using vma_lookup()
Liam R. Howlett [Thu, 8 Apr 2021 20:06:36 +0000 (16:06 -0400)]
xen/privcmd: Optimized privcmd_ioctl_mmap() by using vma_lookup()

vma_lookup() walks the VMA tree for a specific value, find_vma() will
search the tree after walking to a specific value.  It is more efficient
to only walk to the requested value as this case requires the address to
equal the vm_start.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm: Remove rb tree.
Liam R. Howlett [Fri, 24 Jul 2020 19:06:25 +0000 (15:06 -0400)]
mm: Remove rb tree.

Remove the RB tree and start using the maple tree for vm_area_struct
tracking.

Drop validate_mm() calls in expand_upwards() and expand_downwards() as
the lock is not held.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agokernel/fork: Use maple tree for dup_mmap() during forking
Liam R. Howlett [Fri, 24 Jul 2020 17:32:30 +0000 (13:32 -0400)]
kernel/fork: Use maple tree for dup_mmap() during forking

The maple tree was already tracking VMAs in this function by an earlier
commit, but the rbtree iterator was being used to iterate the list.
Change the iterator to use a maple tree native iterator, rcu locking,
and switch to the maple tree advanced API to avoid multiple walks of the
tree during insert operations.

anon_vma_fork() may enter the slow path and cause a schedule() call to
cause rcu issues.  Drop the rcu lock and reacquiring the lock.  There is
no harm in this approach as the mmap_sem is taken for write/read and
held across the schedule() call so the VMAs will not change.

Note that the bulk allocation of nodes is also happening here for
performance reasons.  The node calculations are done internally to the
tree and use the VMA count and assume the worst-case node requirements.
The VM_DONT_COPY flag does not allow for the most efficient copy method
of the tree and so a bulk loading algorithm is used.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/mmap: Use maple tree for unmapped_area{_topdown}
Liam R. Howlett [Fri, 24 Jul 2020 16:01:47 +0000 (12:01 -0400)]
mm/mmap: Use maple tree for unmapped_area{_topdown}

The maple tree code was added to find the unmapped area in a previous
commit and was checked against what the rbtree returned, but the actual
result was never used.  Start using the maple tree implementation and
remove the rbtree code. Note, the advanced maple tree interface is used
so the rcu locking is needed to be handled here or at a higher level.

Add kernel documentation comment for these functions.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/mmap: Use the maple tree for find_vma_prev() instead of the rbtree
Liam R. Howlett [Fri, 24 Jul 2020 15:48:08 +0000 (11:48 -0400)]
mm/mmap: Use the maple tree for find_vma_prev() instead of the rbtree

Use the maple tree's advanced API and a maple state to walk the tree for
the entry at the address or the next vma, then use the maple state to
walk back one entry to find the previous entry.  Note, the advanced
maple tree interface does not handle the rcu locking.

Add kernel documentation comments for this API.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/mmap: Use the maple tree in find_vma() instead of the rbtree.
Liam R. Howlett [Fri, 24 Jul 2020 15:38:39 +0000 (11:38 -0400)]
mm/mmap: Use the maple tree in find_vma() instead of the rbtree.

Using the maple tree interface mt_find() will handle the RCU locking and
will start searching at the address up to the limit, ULONG_MAX in this
case.

Add kernel documentation to this API.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm: Start tracking VMAs with maple tree
Liam R. Howlett [Fri, 24 Jul 2020 17:04:30 +0000 (13:04 -0400)]
mm: Start tracking VMAs with maple tree

Start tracking the VMAs with the new maple tree structure in parallel
with the rb_tree.  Add debug and trace events for maple tree operations
and duplicate the rb_tree that is created on forks into the maple tree.

In this commit, the maple tree is added to the mm_struct including the
mm_init struct, added support in required mm/mmap functions, added
tracking in kernel/fork for process forking, and used to find the
unmapped_area and checked against what the rbtree finds.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoMaple Tree: Add new data structure
Liam R. Howlett [Fri, 24 Jul 2020 17:02:52 +0000 (13:02 -0400)]
Maple Tree: Add new data structure

The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently.  There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface.  The first user that is covered in this
patch set is the vm_area_struct, where three data structures are
replaced by the maple tree: the augmented rbtree, the vma cache, and the
linked list of VMAs in the mm_struct.  The long term goal is to reduce
or remove the mmap_sem contention.

The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes.  With the increased branching factor, it is significantly shorter than
the rbtree so it has fewer cache misses.  The removal of the linked list
between subsequent entries also reduces the cache misses and the need to pull
in the previous and next VMA during many tree alterations.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
3 years agoradix tree test suite: Add support for slab bulk APIs
Liam R. Howlett [Tue, 25 Aug 2020 19:51:54 +0000 (15:51 -0400)]
radix tree test suite: Add support for slab bulk APIs

Add support for kmem_cache_free_bulk() and kmem_cache_alloc_bulk() to
the radix tree test suite.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoradix tree test suite: Add allocation counts and size to kmem_cache
Liam R. Howlett [Tue, 1 Jun 2021 14:35:43 +0000 (10:35 -0400)]
radix tree test suite: Add allocation counts and size to kmem_cache

Add functions to get the number of allocations, and total allocations
from a kmem_cache.  Also add a function to get the allocated size and a
way to zero the total allocations.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoradix tree test suite: Add kmem_cache_set_non_kernel()
Liam R. Howlett [Thu, 27 Feb 2020 20:13:20 +0000 (15:13 -0500)]
radix tree test suite: Add kmem_cache_set_non_kernel()

kmem_cache_set_non_kernel() is a mechanism to allow a certain number of
kmem_cache_alloc requests to succeed even when GFP_KERNEL is not set in
the flags.  This functionality allows for testing different paths though
the code.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
3 years agoradix tree test suite: Add pr_err define
Liam R. Howlett [Tue, 1 Jun 2021 14:25:22 +0000 (10:25 -0400)]
radix tree test suite: Add pr_err define

define pr_err to printk

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoremap_file_pages: Use vma_lookup() instead of find_vma()
Liam R. Howlett [Fri, 18 Jun 2021 00:31:19 +0000 (20:31 -0400)]
remap_file_pages: Use vma_lookup() instead of find_vma()

Using vma_lookup() verifies the start address is contained in the found vma.
This results in easier to read code.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agodrm/amdgpu: Use vma_lookup() in amdgpu_ttm_tt_get_user_pages()
Liam R. Howlett [Thu, 8 Apr 2021 17:28:25 +0000 (13:28 -0400)]
drm/amdgpu: Use vma_lookup() in amdgpu_ttm_tt_get_user_pages()

Use vma_lookup() to find the VMA at a specific address.  As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/nommu: unexport do_munmap()
Liam R. Howlett [Fri, 4 Jun 2021 19:28:52 +0000 (15:28 -0400)]
mm/nommu: unexport do_munmap()

do_munmap() does not take the mmap_write_lock().  vm_munmap() should be
used instead.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoAdd linux-next specific files for 20210603
Stephen Rothwell [Thu, 3 Jun 2021 09:12:22 +0000 (19:12 +1000)]
Add linux-next specific files for 20210603

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agoMerge branch 'akpm/master'
Stephen Rothwell [Thu, 3 Jun 2021 08:17:29 +0000 (18:17 +1000)]
Merge branch 'akpm/master'

3 years agokdump: use vmlinux_build_id to simplify
Stephen Boyd [Wed, 2 Jun 2021 03:53:30 +0000 (13:53 +1000)]
kdump: use vmlinux_build_id to simplify

We can use the vmlinux_build_id array here now instead of open coding it.
This mostly consolidates code.

Link: https://lkml.kernel.org/r/20210511003845.2429846-14-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agobuildid: fix kernel-doc notation
Stephen Boyd [Wed, 2 Jun 2021 03:53:30 +0000 (13:53 +1000)]
buildid: fix kernel-doc notation

Kernel doc should use "Return:" instead of "Returns" to properly reflect
the return values.

Link: https://lkml.kernel.org/r/20210511003845.2429846-13-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agobuildid: mark some arguments const
Stephen Boyd [Wed, 2 Jun 2021 03:53:30 +0000 (13:53 +1000)]
buildid: mark some arguments const

These arguments are never modified so they can be marked const to indicate
as such.

Link: https://lkml.kernel.org/r/20210511003845.2429846-12-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agoscripts/decode_stacktrace.sh: indicate 'auto' can be used for base path
Stephen Boyd [Wed, 2 Jun 2021 03:53:30 +0000 (13:53 +1000)]
scripts/decode_stacktrace.sh: indicate 'auto' can be used for base path

Add "auto" to the usage message so that it's a little clearer that you can
pass "auto" as the second argument.  When passing "auto" the script tries
to find the base path automatically instead of requiring it be passed on
the commandline.  Also use [<variable>] to indicate the variable argument
and that it is optional so that we can differentiate from the literal
"auto" that should be passed.

Link: https://lkml.kernel.org/r/20210511003845.2429846-11-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agoscripts/decode_stacktrace.sh: silence stderr messages from addr2line/nm
Stephen Boyd [Wed, 2 Jun 2021 03:53:30 +0000 (13:53 +1000)]
scripts/decode_stacktrace.sh: silence stderr messages from addr2line/nm

Sometimes if you're using tools that have linked things improperly or have
new features/sections that older tools don't expect you'll see warnings
printed to stderr.  We don't really care about these warnings, so let's
just silence these messages to cleanup output of this script.

Link: https://lkml.kernel.org/r/20210511003845.2429846-10-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agoscripts/decode_stacktrace.sh: support debuginfod
Stephen Boyd [Wed, 2 Jun 2021 03:53:29 +0000 (13:53 +1000)]
scripts/decode_stacktrace.sh: support debuginfod

Now that stacktraces contain the build ID information we can update this
script to use debuginfod-find to locate the debuginfo for the vmlinux and
modules automatically.  This can replace the existing code that requires
specifying a path to vmlinux or tries to find the vmlinux and modules
automatically by using the release number.  Work it into the script as a
fallback option if the vmlinux isn't specified on the commandline.

Link: https://lkml.kernel.org/r/20210511003845.2429846-9-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agox86/dumpstack: use %pSb/%pBb for backtrace printing
Stephen Boyd [Wed, 2 Jun 2021 03:53:29 +0000 (13:53 +1000)]
x86/dumpstack: use %pSb/%pBb for backtrace printing

Let's use the new printk formats to print the stacktrace entries when
printing a backtrace to the kernel logs.  This will include any module's
build ID[1] in it so that offline/crash debugging can easily locate the
debuginfo for a module via something like debuginfod[2].

Link: https://lkml.kernel.org/r/20210511003845.2429846-8-swboyd@chromium.org
Link: https://fedoraproject.org/wiki/Releases/FeatureBuildId
Link: https://sourceware.org/elfutils/Debuginfod.html
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agoarm64: stacktrace: use %pSb for backtrace printing
Stephen Boyd [Wed, 2 Jun 2021 03:53:29 +0000 (13:53 +1000)]
arm64: stacktrace: use %pSb for backtrace printing

Let's use the new printk format to print the stacktrace entry when
printing a backtrace to the kernel logs. This will include any module's
build ID[1] in it so that offline/crash debugging can easily locate the
debuginfo for a module via something like debuginfod[2].

Link: https://lkml.kernel.org/r/20210511003845.2429846-7-swboyd@chromium.org
Link: https://fedoraproject.org/wiki/Releases/FeatureBuildId
Link: https://sourceware.org/elfutils/Debuginfod.html
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agomodule: fix build error when CONFIG_SYSFS is disabled
Bixuan Cui [Wed, 2 Jun 2021 03:53:29 +0000 (13:53 +1000)]
module: fix build error when CONFIG_SYSFS is disabled

Fix build error when disable CONFIG_SYSFS:
kernel/module.c:2805:8: error: implicit declaration of function `sect_empty'; did you mean `desc_empty'? [-Werror=implicit-function-declaration]
 2805 |   if (!sect_empty(sechdr) && sechdr->sh_type == SHT_NOTE &&

Link: https://lkml.kernel.org/r/20210525105049.34804-1-cuibixuan@huawei.com
Fixes: 9ee6682aa528 ("module: add printk formats to add module build ID to stacktraces")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Acked-by: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agomodule-add-printk-formats-to-add-module-build-id-to-stacktraces-fix-fix
Andrew Morton [Wed, 2 Jun 2021 03:53:29 +0000 (13:53 +1000)]
module-add-printk-formats-to-add-module-build-id-to-stacktraces-fix-fix

make kallsyms_lookup_buildid() static

warning: no previous prototype for 'kallsyms_lookup_buildid' [-Wmissing-prototypes]

Cc: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agobuildid: fix build when CONFIG_MODULES is not set
Randy Dunlap [Wed, 2 Jun 2021 03:53:28 +0000 (13:53 +1000)]
buildid: fix build when CONFIG_MODULES is not set

Omit the static_assert() when CONFIG_MODULES is not set/enabled.
Fixes these build errors:

../kernel/kallsyms.c: In function `__sprint_symbol':
../include/linux/kernel.h:53:43: error: dereferencing pointer to incomplete type `struct module'
 #define typeof_member(T, m) typeof(((T*)0)->m)
                                           ^
../include/linux/build_bug.h:78:41: error: static assertion failed: "sizeof(typeof_member(struct module, build_id)) == 20"
 #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
                                         ^
../kernel/kallsyms.c:454:4: note: in expansion of macro `static_assert'
    static_assert(sizeof(typeof_member(struct module, build_id)) == 20);
    ^~~~~~~~~~~~~

Link: https://lkml.kernel.org/r/20210513171510.20328-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agomodule-add-printk-formats-to-add-module-build-id-to-stacktraces-fix
Andrew Morton [Wed, 2 Jun 2021 03:53:28 +0000 (13:53 +1000)]
module-add-printk-formats-to-add-module-build-id-to-stacktraces-fix

fix build with CONFIG_MODULES=n, tweak code layout

Cc: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agomodule: add printk formats to add module build ID to stacktraces
Stephen Boyd [Wed, 2 Jun 2021 03:53:28 +0000 (13:53 +1000)]
module: add printk formats to add module build ID to stacktraces

Let's make kernel stacktraces easier to identify by including the build
ID[1] of a module if the stacktrace is printing a symbol from a module.
This makes it simpler for developers to locate a kernel module's full
debuginfo for a particular stacktrace.  Combined with
scripts/decode_stracktrace.sh, a developer can download the matching
debuginfo from a debuginfod[2] server and find the exact file and line
number for the functions plus offsets in a stacktrace that match the
module.  This is especially useful for pstore crash debugging where the
kernel crashes are recorded in something like console-ramoops and the
recovery kernel/modules are different or the debuginfo doesn't exist on
the device due to space concerns (the debuginfo can be too large for space
limited devices).

Originally, I put this on the %pS format, but that was quickly rejected
given that %pS is used in other places such as ftrace where build IDs
aren't meaningful.  There was some discussions on the list to put every
module build ID into the "Modules linked in:" section of the stacktrace
message but that quickly becomes very hard to read once you have more than
three or four modules linked in.  It also provides too much information
when we don't expect each module to be traversed in a stacktrace.  Having
the build ID for modules that aren't important just makes things messy.
Splitting it to multiple lines for each module quickly explodes the number
of lines printed in an oops too, possibly wrapping the warning off the
console.  And finally, trying to stash away each module used in a
callstack to provide the ID of each symbol printed is cumbersome and would
require changes to each architecture to stash away modules and return
their build IDs once unwinding has completed.

Instead, we opt for the simpler approach of introducing new printk formats
'%pS[R]b' for "pointer symbolic backtrace with module build ID" and '%pBb'
for "pointer backtrace with module build ID" and then updating the few
places in the architecture layer where the stacktrace is printed to use
this new format.

Before:

 Call trace:
  lkdtm_WARNING+0x28/0x30 [lkdtm]
  direct_entry+0x16c/0x1b4 [lkdtm]
  full_proxy_write+0x74/0xa4
  vfs_write+0xec/0x2e8

After:

 Call trace:
  lkdtm_WARNING+0x28/0x30 [lkdtm 6c2215028606bda50de823490723dc4bc5bf46f9]
  direct_entry+0x16c/0x1b4 [lkdtm 6c2215028606bda50de823490723dc4bc5bf46f9]
  full_proxy_write+0x74/0xa4
  vfs_write+0xec/0x2e8

Link: https://lkml.kernel.org/r/20210511003845.2429846-6-swboyd@chromium.org
Link: https://fedoraproject.org/wiki/Releases/FeatureBuildId
Link: https://sourceware.org/elfutils/Debuginfod.html
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agodump_stack: add vmlinux build ID to stack traces
Stephen Boyd [Wed, 2 Jun 2021 03:53:28 +0000 (13:53 +1000)]
dump_stack: add vmlinux build ID to stack traces

Add the running kernel's build ID[1] to the stacktrace information header.
This makes it simpler for developers to locate the vmlinux with full
debuginfo for a particular kernel stacktrace.  Combined with
scripts/decode_stracktrace.sh, a developer can download the correct
vmlinux from a debuginfod[2] server and find the exact file and line
number for the functions plus offsets in a stacktrace.

This is especially useful for pstore crash debugging where the kernel
crashes are recorded in the pstore logs and the recovery kernel is
different or the debuginfo doesn't exist on the device due to space
concerns (the data can be large and a security concern).  The stacktrace
can be analyzed after the crash by using the build ID to find the matching
vmlinux and understand where in the function something went wrong.

Example stacktrace from lkdtm:

 WARNING: CPU: 4 PID: 3255 at drivers/misc/lkdtm/bugs.c:83 lkdtm_WARNING+0x28/0x30 [lkdtm]
 Modules linked in: lkdtm rfcomm algif_hash algif_skcipher af_alg xt_cgroup uinput xt_MASQUERADE
 CPU: 4 PID: 3255 Comm: bash Not tainted 5.11 #3 aa23f7a1231c229de205662d5a9e0d4c580f19a1
 Hardware name: Google Lazor (rev3+) with KB Backlight (DT)
 pstate: 00400009 (nzcv daif +PAN -UAO -TCO BTYPE=--)
 pc : lkdtm_WARNING+0x28/0x30 [lkdtm]

The hex string aa23f7a1231c229de205662d5a9e0d4c580f19a1 is the build ID,
following the kernel version number. Put it all behind a config option,
STACKTRACE_BUILD_ID, so that kernel developers can remove this
information if they decide it is too much.

Link: https://lkml.kernel.org/r/20210511003845.2429846-5-swboyd@chromium.org
Link: https://fedoraproject.org/wiki/Releases/FeatureBuildId
Link: https://sourceware.org/elfutils/Debuginfod.html
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agobuildid-stash-away-kernels-build-id-on-init-fix
Stephen Boyd [Wed, 2 Jun 2021 03:53:28 +0000 (13:53 +1000)]
buildid-stash-away-kernels-build-id-on-init-fix

fix implicit declaration of function 'init_vmlinux_build_id'

Link: https://lkml.kernel.org/r/CAE-0n51UjTbay8N9FXAyE7_aR2+ePrQnKSRJ0gbmRsXtcLBVaw@mail.gmail.com
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agobuildid: stash away kernels build ID on init
Stephen Boyd [Wed, 2 Jun 2021 03:53:27 +0000 (13:53 +1000)]
buildid: stash away kernels build ID on init

Parse the kernel's build ID at initialization so that other code can print
a hex format string representation of the running kernel's build ID.  This
will be used in the kdump and dump_stack code so that developers can
easily locate the vmlinux debug symbols for a crash/stacktrace.

Link: https://lkml.kernel.org/r/20210511003845.2429846-4-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agobuildid: add API to parse build ID out of buffer
Stephen Boyd [Wed, 2 Jun 2021 03:53:27 +0000 (13:53 +1000)]
buildid: add API to parse build ID out of buffer

Add an API that can parse the build ID out of a buffer, instead of a vma,
to support printing a kernel module's build ID for stack traces.

Link: https://lkml.kernel.org/r/20210511003845.2429846-3-swboyd@chromium.org
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agobuildid: only consider GNU notes for build ID parsing
Stephen Boyd [Wed, 2 Jun 2021 03:53:27 +0000 (13:53 +1000)]
buildid: only consider GNU notes for build ID parsing

Patch series "Add build ID to stacktraces", v6.

This series adds the kernel's build ID[1] to the stacktrace header printed
in oops messages, warnings, etc.  and the build ID for any module that
appears in the stacktrace after the module name.  The goal is to make the
stacktrace more self-contained and descriptive by including the relevant
build IDs in the kernel logs when something goes wrong.  This can be used
by post processing tools like script/decode_stacktrace.sh and kernel
developers to easily locate the debug info associated with a kernel crash
and line up what line and file things started falling apart at.

To show how this can be used I've included a patch to decode_stacktrace.sh
that downloads the debuginfo from a debuginfod server.  This also includes
some patches to make the buildid.c file use more const arguments and
consolidate logic into buildid.c from kdump.  These are left to the end as
they were mostly cleanup patches.

Here's an example lkdtm stacktrace on arm64.

 WARNING: CPU: 4 PID: 3255 at drivers/misc/lkdtm/bugs.c:83 lkdtm_WARNING+0x28/0x30 [lkdtm]
 Modules linked in: lkdtm rfcomm algif_hash algif_skcipher af_alg xt_cgroup uinput xt_MASQUERADE
 CPU: 4 PID: 3255 Comm: bash Not tainted 5.11 #3 aa23f7a1231c229de205662d5a9e0d4c580f19a1
 Hardware name: Google Lazor (rev3+) with KB Backlight (DT)
 pstate: 00400009 (nzcv daif +PAN -UAO -TCO BTYPE=--)
 pc : lkdtm_WARNING+0x28/0x30 [lkdtm]
 lr : lkdtm_do_action+0x24/0x40 [lkdtm]
 sp : ffffffc0134fbca0
 x29: ffffffc0134fbca0 x28: ffffff92d53ba240
 x27: 0000000000000000 x26: 0000000000000000
 x25: 0000000000000000 x24: ffffffe3622352c0
 x23: 0000000000000020 x22: ffffffe362233366
 x21: ffffffe3622352e0 x20: ffffffc0134fbde0
 x19: 0000000000000008 x18: 0000000000000000
 x17: ffffff929b6536fc x16: 0000000000000000
 x15: 0000000000000000 x14: 0000000000000012
 x13: ffffffe380ed892c x12: ffffffe381d05068
 x11: 0000000000000000 x10: 0000000000000000
 x9 : 0000000000000001 x8 : ffffffe362237000
 x7 : aaaaaaaaaaaaaaaa x6 : 0000000000000000
 x5 : 0000000000000000 x4 : 0000000000000001
 x3 : 0000000000000008 x2 : ffffff93fef25a70
 x1 : ffffff93fef15788 x0 : ffffffe3622352e0
 Call trace:
  lkdtm_WARNING+0x28/0x30 [lkdtm ed5019fdf5e53be37cb1ba7899292d7e143b259e]
  direct_entry+0x16c/0x1b4 [lkdtm ed5019fdf5e53be37cb1ba7899292d7e143b259e]
  full_proxy_write+0x74/0xa4
  vfs_write+0xec/0x2e8
  ksys_write+0x84/0xf0
  __arm64_sys_write+0x24/0x30
  el0_svc_common+0xf4/0x1c0
  do_el0_svc_compat+0x28/0x3c
  el0_svc_compat+0x10/0x1c
  el0_sync_compat_handler+0xa8/0xcc
  el0_sync_compat+0x178/0x180
 ---[ end trace 3d95032303e59e68 ]---

This patch (of 13):

Some kernel elf files have various notes that also happen to have an elf
note type of '3', which matches NT_GNU_BUILD_ID but the note name isn't
"GNU".  For example, this note trips up the existing logic:

 Owner  Data size   Description
 Xen    0x00000008  Unknown note type: (0x00000003) description data: 00 00 00 ffffff80 ffffffff ffffffff ffffffff ffffffff

Let's make sure that it is a GNU note when parsing the build ID so that we
can use this function to parse a vmlinux's build ID too.

Link: https://lkml.kernel.org/r/20210511003845.2429846-1-swboyd@chromium.org
Link: https://lkml.kernel.org/r/20210511003845.2429846-2-swboyd@chromium.org
Fixes: bd7525dacd7e ("bpf: Move stack_map_get_build_id into lib")
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Reported-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Evan Green <evgreen@chromium.org>
Cc: Hsin-Yi Wang <hsinyi@chromium.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agomm: fix spelling mistakes in header files
Zhen Lei [Wed, 2 Jun 2021 03:53:27 +0000 (13:53 +1000)]
mm: fix spelling mistakes in header files

Fix some spelling mistakes in comments:
successfull ==> successful
potentialy ==> potentially
alloced ==> allocated
indicies ==> indices
wont ==> won't
resposible ==> responsible
dirtyness ==> dirtiness
droppped ==> dropped
alread ==> already
occured ==> occurred
interupts ==> interrupts
extention ==> extension
slighly ==> slightly
Dont't ==> Don't

Link: https://lkml.kernel.org/r/20210531034849.9549-2-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agosecretmem: test: add basic selftest for memfd_secret(2)
Mike Rapoport [Wed, 2 Jun 2021 03:53:27 +0000 (13:53 +1000)]
secretmem: test: add basic selftest for memfd_secret(2)

The test verifies that file descriptor created with memfd_secret does not
allow read/write operations, that secret memory mappings respect
RLIMIT_MEMLOCK and that remote accesses with process_vm_read() and
ptrace() to the secret memory fail.

Link: https://lkml.kernel.org/r/20210518072034.31572-8-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agoarch, mm: wire up memfd_secret system call where relevant
Mike Rapoport [Wed, 2 Jun 2021 03:53:26 +0000 (13:53 +1000)]
arch, mm: wire up memfd_secret system call where relevant

Wire up memfd_secret system call on architectures that define
ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86.

Link: https://lkml.kernel.org/r/20210518072034.31572-7-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agoPM: hibernate: disable when there are active secretmem users
Mike Rapoport [Wed, 2 Jun 2021 03:53:26 +0000 (13:53 +1000)]
PM: hibernate: disable when there are active secretmem users

It is unsafe to allow saving of secretmem areas to the hibernation
snapshot as they would be visible after the resume and this essentially
will defeat the purpose of secret memory mappings.

Prevent hibernation whenever there are active secret memory users.

Link: https://lkml.kernel.org/r/20210518072034.31572-6-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agomm-introduce-memfd_secret-system-call-to-create-secret-memory-areas-fix
Andrew Morton [Wed, 2 Jun 2021 03:53:26 +0000 (13:53 +1000)]
mm-introduce-memfd_secret-system-call-to-create-secret-memory-areas-fix

suppress Kconfig whine

Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agomm: introduce memfd_secret system call to create "secret" memory areas
Mike Rapoport [Wed, 2 Jun 2021 03:53:26 +0000 (13:53 +1000)]
mm: introduce memfd_secret system call to create "secret" memory areas

Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.

The secretmem feature is off by default and the user must explicitly
enable it at the boot time.

Once secretmem is enabled, the user will be able to create a file
descriptor using the memfd_secret() system call.  The memory areas created
by mmap() calls from this file descriptor will be unmapped from the kernel
direct map and they will be only mapped in the page table of the processes
that have access to the file descriptor.

Secretmem is designed to provide the following protections:

* Enhanced protection (in conjunction with all the other in-kernel
  attack prevention systems) against ROP attacks.  Seceretmem makes
  "simple" ROP insufficient to perform exfiltration, which increases the
  required complexity of the attack.  Along with other protections like
  the kernel stack size limit and address space layout randomization which
  make finding gadgets is really hard, absence of any in-kernel primitive
  for accessing secret memory means the one gadget ROP attack can't work.
  Since the only way to access secret memory is to reconstruct the missing
  mapping entry, the attacker has to recover the physical page and insert
  a PTE pointing to it in the kernel and then retrieve the contents.  That
  takes at least three gadgets which is a level of difficulty beyond most
  standard attacks.

* Prevent cross-process secret userspace memory exposures.  Once the
  secret memory is allocated, the user can't accidentally pass it into the
  kernel to be transmitted somewhere.  The secreremem pages cannot be
  accessed via the direct map and they are disallowed in GUP.

* Harden against exploited kernel flaws.  In order to access secretmem,
  a kernel-side attack would need to either walk the page tables and
  create new ones, or spawn a new privileged uiserspace process to perform
  secrets exfiltration using ptrace.

The file descriptor based memory has several advantages over the
"traditional" mm interfaces, such as mlock(), mprotect(), madvise().  File
descriptor approach allows explicit and controlled sharing of the memory
areas, it allows to seal the operations.  Besides, file descriptor based
memory paves the way for VMMs to remove the secret memory range from the
userspace hipervisor process, for instance QEMU.  Andy Lutomirski says:

  "Getting fd-backed memory into a guest will take some possibly major
  work in the kernel, but getting vma-backed memory into a guest without
  mapping it in the host user address space seems much, much worse."

memfd_secret() is made a dedicated system call rather than an extension to
memfd_create() because it's purpose is to allow the user to create more
secure memory mappings rather than to simply allow file based access to
the memory.  Nowadays a new system call cost is negligible while it is way
simpler for userspace to deal with a clear-cut system calls than with a
multiplexer or an overloaded syscall.  Moreover, the initial
implementation of memfd_secret() is completely distinct from
memfd_create() so there is no much sense in overloading memfd_create() to
begin with.  If there will be a need for code sharing between these
implementation it can be easily achieved without a need to adjust user
visible APIs.

The secret memory remains accessible in the process context using uaccess
primitives, but it is not exposed to the kernel otherwise; secret memory
areas are removed from the direct map and functions in the
follow_page()/get_user_page() family will refuse to return a page that
belongs to the secret memory area.

Once there will be a use case that will require exposing secretmem to the
kernel it will be an opt-in request in the system call flags so that user
would have to decide what data can be exposed to the kernel.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which
affects the system performance.  However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "...  can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice".  Hence, it is sufficient to
have secretmem disabled by default with the ability of a system
administrator to enable it at boot time.

Pages in the secretmem regions are unevictable and unmovable to avoid
accidental exposure of the sensitive data via swap or during page
migration.

Since the secretmem mappings are locked in memory they cannot exceed
RLIMIT_MEMLOCK.  Since these mappings are already locked independently
from mlock(), an attempt to mlock()/munlock() secretmem range would fail
and mlockall()/munlockall() will ignore secretmem mappings.

However, unlike mlock()ed memory, secretmem currently behaves more like
long-term GUP: secretmem mappings are unmovable mappings directly consumed
by user space.  With default limits, there is no excessive use of
secretmem and it poses no real problem in combination with
ZONE_MOVABLE/CMA, but in the future this should be addressed to allow
balanced use of large amounts of secretmem along with ZONE_MOVABLE/CMA.

A page that was a part of the secret memory area is cleared when it is
freed to ensure the data is not exposed to the next user of that page.

The following example demonstrates creation of a secret mapping (error
handling is omitted):

fd = memfd_secret(0);
ftruncate(fd, MAP_SIZE);
ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
   MAP_SHARED, fd, 0);

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

Link: https://lkml.kernel.org/r/20210518072034.31572-5-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agoset_memory: allow querying whether set_direct_map_*() is actually enabled
Mike Rapoport [Wed, 2 Jun 2021 03:53:26 +0000 (13:53 +1000)]
set_memory: allow querying whether set_direct_map_*() is actually enabled

On arm64, set_direct_map_*() functions may return 0 without actually
changing the linear map.  This behaviour can be controlled using kernel
parameters, so we need a way to determine at runtime whether calls to
set_direct_map_invalid_noflush() and set_direct_map_default_noflush() have
any effect.

Extend set_memory API with can_set_direct_map() function that allows
checking if calling set_direct_map_*() will actually change the page
table, replace several occurrences of open coded checks in arm64 with the
new function and provide a generic stub for architectures that always
modify page tables upon calls to set_direct_map APIs.

[arnd@arndb.de: arm64: kfence: fix header inclusion ]
Link: https://lkml.kernel.org/r/20210518072034.31572-4-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agoriscv/Kconfig: make direct map manipulation options depend on MMU
Mike Rapoport [Wed, 2 Jun 2021 03:53:25 +0000 (13:53 +1000)]
riscv/Kconfig: make direct map manipulation options depend on MMU

ARCH_HAS_SET_DIRECT_MAP and ARCH_HAS_SET_MEMORY configuration options have
no meaning when CONFIG_MMU is disabled and there is no point to enable
them for the nommu case.

Add an explicit dependency on MMU for these options.

Link: https://lkml.kernel.org/r/20210518072034.31572-3-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agommap: make mlock_future_check() global
Mike Rapoport [Wed, 2 Jun 2021 03:53:25 +0000 (13:53 +1000)]
mmap: make mlock_future_check() global

Patch series "mm: introduce memfd_secret system call to create "secret" memory areas", v20.

This is an implementation of "secret" mappings backed by a file
descriptor.

The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call.  The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping.  The pages in that mapping will be marked as not present
in the direct map and will be present only in the page table of the owning
mm.

Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.

It's designed to provide the following protections:

* Enhanced protection (in conjunction with all the other in-kernel
  attack prevention systems) against ROP attacks.  Seceretmem makes
  "simple" ROP insufficient to perform exfiltration, which increases the
  required complexity of the attack.  Along with other protections like
  the kernel stack size limit and address space layout randomization which
  make finding gadgets is really hard, absence of any in-kernel primitive
  for accessing secret memory means the one gadget ROP attack can't work.
  Since the only way to access secret memory is to reconstruct the missing
  mapping entry, the attacker has to recover the physical page and insert
  a PTE pointing to it in the kernel and then retrieve the contents.  That
  takes at least three gadgets which is a level of difficulty beyond most
  standard attacks.

* Prevent cross-process secret userspace memory exposures.  Once the
  secret memory is allocated, the user can't accidentally pass it into the
  kernel to be transmitted somewhere.  The secreremem pages cannot be
  accessed via the direct map and they are disallowed in GUP.

* Harden against exploited kernel flaws.  In order to access secretmem,
  a kernel-side attack would need to either walk the page tables and
  create new ones, or spawn a new privileged uiserspace process to perform
  secrets exfiltration using ptrace.

In the future the secret mappings may be used as a mean to protect guest
memory in a virtual machine host.

For demonstration of secret memory usage we've created a userspace library

https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git

that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it.  We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.

Hiding secret memory mappings behind an anonymous file allows usage of the
page cache for tracking pages allocated for the "secret" mappings as well
as using address_space_operations for e.g.  page migration callbacks.

The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native"
mm ABIs in the future.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which
affects the system performance.  However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "...  can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice".  Hence, it is sufficient to
have secretmem disabled by default with the ability of a system
administrator to enable it at boot time.

In addition, there is also a long term goal to improve management of the
direct map.

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

This patch (of 7):

It will be used by the upcoming secret memory implementation.

Link: https://lkml.kernel.org/r/20210518072034.31572-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20210518072034.31572-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Cc: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agomm/slub: use stackdepot to save stack trace in objects-fix
Vlastimil Babka [Wed, 2 Jun 2021 03:53:25 +0000 (13:53 +1000)]
mm/slub: use stackdepot to save stack trace in objects-fix

Paul reports [1] lockdep splat HARDIRQ-safe -> HARDIRQ-unsafe lock order
detected.  Kernel test robot reports [2]
BUG:sleeping_function_called_from_invalid_context_at_mm/page_alloc.c

The stack trace might be saved from contexts where we can't block so GFP_KERNEL
is unsafe. So use GFP_NOWAIT. Under memory pressure we might thus fail to save
some new unique stack, but that should be extremely rare.

[1] https://lore.kernel.org/linux-mm/20210515204622.GA2672367@paulmck-ThinkPad-P17-Gen-1/
[2] https://lore.kernel.org/linux-mm/20210516144152.GA25903@xsang-OptiPlex-9020/

Link: https://lkml.kernel.org/r/20210516195150.26740-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agoslub: STACKDEPOT: rename save_stack_trace()
Randy Dunlap [Wed, 2 Jun 2021 03:53:25 +0000 (13:53 +1000)]
slub: STACKDEPOT: rename save_stack_trace()

save_stack_trace() already exists, so change the one in
CONFIG_STACKDEPOT to be save_stack_depot_trace().

Fixes this build error:

../mm/slub.c:607:29: error: conflicting types for `save_stack_trace'
 static depot_stack_handle_t save_stack_trace(gfp_t flags)
                             ^~~~~~~~~~~~~~~~
In file included from ../include/linux/page_ext.h:6:0,
                 from ../include/linux/mm.h:25,
                 from ../mm/slub.c:13:
../include/linux/stacktrace.h:86:13: note: previous declaration of `save_stack_trace' was here
 extern void save_stack_trace(struct stack_trace *trace);
             ^~~~~~~~~~~~~~~~

from this patch in mmotm:
  Subject: mm/slub: use stackdepot to save stack trace in objects

Link: https://lkml.kernel.org/r/20210513051920.29320-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Oliver Glitta <glittao@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agomm/slub: use stackdepot to save stack trace in objects
Oliver Glitta [Wed, 2 Jun 2021 03:53:25 +0000 (13:53 +1000)]
mm/slub: use stackdepot to save stack trace in objects

Many stack traces are similar so there are many similar arrays.
Stackdepot saves each unique stack only once.

Replace field addrs in struct track with depot_stack_handle_t handle.  Use
stackdepot to save stack trace.

The benefits are smaller memory overhead and possibility to aggregate
per-cache statistics in the future using the stackdepot handle instead of
matching stacks manually.

Link: https://lkml.kernel.org/r/20210414163434.4376-1-glittao@gmail.com
Signed-off-by: Oliver Glitta <glittao@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agoMerge branch 'akpm-current/current'
Stephen Rothwell [Thu, 3 Jun 2021 08:01:30 +0000 (18:01 +1000)]
Merge branch 'akpm-current/current'

3 years agoMerge remote-tracking branch 'tpmdd-jejb/tpmdd-for-next'
Stephen Rothwell [Thu, 3 Jun 2021 07:37:35 +0000 (17:37 +1000)]
Merge remote-tracking branch 'tpmdd-jejb/tpmdd-for-next'

3 years agoMerge remote-tracking branch 'cxl/next'
Stephen Rothwell [Thu, 3 Jun 2021 07:35:53 +0000 (17:35 +1000)]
Merge remote-tracking branch 'cxl/next'

3 years agoMerge remote-tracking branch 'rust/rust-next'
Stephen Rothwell [Thu, 3 Jun 2021 07:20:52 +0000 (17:20 +1000)]
Merge remote-tracking branch 'rust/rust-next'

# Conflicts:
# Makefile
# include/uapi/linux/android/binder.h
# kernel/printk/printk.c

3 years agoMerge remote-tracking branch 'memblock/for-next'
Stephen Rothwell [Thu, 3 Jun 2021 07:18:42 +0000 (17:18 +1000)]
Merge remote-tracking branch 'memblock/for-next'

3 years agoMerge remote-tracking branch 'mhi/mhi-next'
Stephen Rothwell [Thu, 3 Jun 2021 07:17:03 +0000 (17:17 +1000)]
Merge remote-tracking branch 'mhi/mhi-next'

3 years agoMerge remote-tracking branch 'fpga/for-next'
Stephen Rothwell [Thu, 3 Jun 2021 07:15:20 +0000 (17:15 +1000)]
Merge remote-tracking branch 'fpga/for-next'

3 years agoMerge remote-tracking branch 'auxdisplay/auxdisplay'
Stephen Rothwell [Thu, 3 Jun 2021 07:14:00 +0000 (17:14 +1000)]
Merge remote-tracking branch 'auxdisplay/auxdisplay'

3 years agoMerge remote-tracking branch 'hyperv/hyperv-next'
Stephen Rothwell [Thu, 3 Jun 2021 07:11:46 +0000 (17:11 +1000)]
Merge remote-tracking branch 'hyperv/hyperv-next'

3 years agoMerge remote-tracking branch 'nvmem/for-next'
Stephen Rothwell [Thu, 3 Jun 2021 07:09:40 +0000 (17:09 +1000)]
Merge remote-tracking branch 'nvmem/for-next'

3 years agoMerge remote-tracking branch 'slimbus/for-next'
Stephen Rothwell [Thu, 3 Jun 2021 07:08:01 +0000 (17:08 +1000)]
Merge remote-tracking branch 'slimbus/for-next'

# Conflicts:
# drivers/nvmem/Kconfig
# drivers/nvmem/Makefile

3 years agoMerge remote-tracking branch 'gnss/gnss-next'
Stephen Rothwell [Thu, 3 Jun 2021 07:06:18 +0000 (17:06 +1000)]
Merge remote-tracking branch 'gnss/gnss-next'

3 years agoMerge remote-tracking branch 'kspp/for-next/kspp'
Stephen Rothwell [Thu, 3 Jun 2021 06:51:13 +0000 (16:51 +1000)]
Merge remote-tracking branch 'kspp/for-next/kspp'

3 years agoMerge remote-tracking branch 'seccomp/for-next/seccomp'
Stephen Rothwell [Thu, 3 Jun 2021 06:37:40 +0000 (16:37 +1000)]
Merge remote-tracking branch 'seccomp/for-next/seccomp'

# Conflicts:
# kernel/seccomp.c

3 years agoMerge remote-tracking branch 'nvdimm/libnvdimm-for-next'
Stephen Rothwell [Thu, 3 Jun 2021 06:37:37 +0000 (16:37 +1000)]
Merge remote-tracking branch 'nvdimm/libnvdimm-for-next'

3 years agoMerge remote-tracking branch 'rtc/rtc-next'
Stephen Rothwell [Thu, 3 Jun 2021 06:35:27 +0000 (16:35 +1000)]
Merge remote-tracking branch 'rtc/rtc-next'

3 years agoMerge remote-tracking branch 'coresight/next'
Stephen Rothwell [Thu, 3 Jun 2021 06:35:23 +0000 (16:35 +1000)]
Merge remote-tracking branch 'coresight/next'

3 years agoMerge remote-tracking branch 'livepatching/for-next'
Stephen Rothwell [Thu, 3 Jun 2021 06:35:23 +0000 (16:35 +1000)]
Merge remote-tracking branch 'livepatching/for-next'

3 years agoMerge remote-tracking branch 'userns/for-next'
Stephen Rothwell [Thu, 3 Jun 2021 06:20:30 +0000 (16:20 +1000)]
Merge remote-tracking branch 'userns/for-next'

# Conflicts:
# include/linux/sched/user.h
# include/linux/user_namespace.h
# kernel/signal.c
# kernel/ucount.c

3 years agoMerge remote-tracking branch 'pwm/for-next'
Stephen Rothwell [Thu, 3 Jun 2021 06:18:23 +0000 (16:18 +1000)]
Merge remote-tracking branch 'pwm/for-next'

3 years agoMerge remote-tracking branch 'pinctrl-renesas/renesas-pinctrl'
Stephen Rothwell [Thu, 3 Jun 2021 06:16:18 +0000 (16:16 +1000)]
Merge remote-tracking branch 'pinctrl-renesas/renesas-pinctrl'

3 years agoMerge remote-tracking branch 'pinctrl-intel/for-next'
Stephen Rothwell [Thu, 3 Jun 2021 06:14:39 +0000 (16:14 +1000)]
Merge remote-tracking branch 'pinctrl-intel/for-next'

3 years agoMerge remote-tracking branch 'pinctrl/for-next'
Stephen Rothwell [Thu, 3 Jun 2021 06:12:14 +0000 (16:12 +1000)]
Merge remote-tracking branch 'pinctrl/for-next'

# Conflicts:
# include/linux/pinctrl/pinconf-generic.h

3 years agoMerge remote-tracking branch 'gpio-intel/for-next'
Stephen Rothwell [Thu, 3 Jun 2021 06:10:36 +0000 (16:10 +1000)]
Merge remote-tracking branch 'gpio-intel/for-next'

3 years agoMerge remote-tracking branch 'gpio-brgl/gpio/for-next'
Stephen Rothwell [Thu, 3 Jun 2021 06:08:30 +0000 (16:08 +1000)]
Merge remote-tracking branch 'gpio-brgl/gpio/for-next'

3 years agoMerge remote-tracking branch 'rpmsg/for-next'
Stephen Rothwell [Thu, 3 Jun 2021 06:06:16 +0000 (16:06 +1000)]
Merge remote-tracking branch 'rpmsg/for-next'

3 years agoMerge remote-tracking branch 'vhost/linux-next'
Stephen Rothwell [Thu, 3 Jun 2021 06:04:08 +0000 (16:04 +1000)]
Merge remote-tracking branch 'vhost/linux-next'

3 years agoMerge remote-tracking branch 'scsi-mkp/for-next'
Stephen Rothwell [Thu, 3 Jun 2021 06:01:33 +0000 (16:01 +1000)]
Merge remote-tracking branch 'scsi-mkp/for-next'

3 years agoMerge remote-tracking branch 'scsi/for-next'
Stephen Rothwell [Thu, 3 Jun 2021 05:58:58 +0000 (15:58 +1000)]
Merge remote-tracking branch 'scsi/for-next'

3 years agoMerge remote-tracking branch 'cgroup/for-next'
Stephen Rothwell [Thu, 3 Jun 2021 05:49:06 +0000 (15:49 +1000)]
Merge remote-tracking branch 'cgroup/for-next'

3 years agoMerge remote-tracking branch 'dmaengine/next'
Stephen Rothwell [Thu, 3 Jun 2021 05:46:51 +0000 (15:46 +1000)]
Merge remote-tracking branch 'dmaengine/next'

3 years agoMerge remote-tracking branch 'icc/icc-next'
Stephen Rothwell [Thu, 3 Jun 2021 05:45:04 +0000 (15:45 +1000)]
Merge remote-tracking branch 'icc/icc-next'

3 years agoMerge remote-tracking branch 'iio/togreg'
Stephen Rothwell [Thu, 3 Jun 2021 05:43:06 +0000 (15:43 +1000)]
Merge remote-tracking branch 'iio/togreg'

3 years agoMerge remote-tracking branch 'staging/staging-next'
Stephen Rothwell [Thu, 3 Jun 2021 05:40:58 +0000 (15:40 +1000)]
Merge remote-tracking branch 'staging/staging-next'

3 years agoMerge remote-tracking branch 'thunderbolt/next'
Stephen Rothwell [Thu, 3 Jun 2021 05:39:21 +0000 (15:39 +1000)]
Merge remote-tracking branch 'thunderbolt/next'

3 years agoMerge remote-tracking branch 'soundwire/next'
Stephen Rothwell [Thu, 3 Jun 2021 05:37:38 +0000 (15:37 +1000)]
Merge remote-tracking branch 'soundwire/next'

3 years agoMerge remote-tracking branch 'phy-next/next'
Stephen Rothwell [Thu, 3 Jun 2021 05:35:29 +0000 (15:35 +1000)]
Merge remote-tracking branch 'phy-next/next'

3 years agoMerge remote-tracking branch 'extcon/extcon-next'
Stephen Rothwell [Thu, 3 Jun 2021 05:33:51 +0000 (15:33 +1000)]
Merge remote-tracking branch 'extcon/extcon-next'

3 years agoMerge remote-tracking branch 'char-misc/char-misc-next'
Stephen Rothwell [Thu, 3 Jun 2021 05:19:05 +0000 (15:19 +1000)]
Merge remote-tracking branch 'char-misc/char-misc-next'

3 years agoMerge remote-tracking branch 'tty/tty-next'
Stephen Rothwell [Thu, 3 Jun 2021 05:16:41 +0000 (15:16 +1000)]
Merge remote-tracking branch 'tty/tty-next'

3 years agoMerge remote-tracking branch 'usb-chipidea-next/for-usb-next'
Stephen Rothwell [Thu, 3 Jun 2021 05:15:02 +0000 (15:15 +1000)]
Merge remote-tracking branch 'usb-chipidea-next/for-usb-next'

3 years agoMerge remote-tracking branch 'usb-serial/usb-next'
Stephen Rothwell [Thu, 3 Jun 2021 05:13:18 +0000 (15:13 +1000)]
Merge remote-tracking branch 'usb-serial/usb-next'

3 years agoMerge remote-tracking branch 'usb/usb-next'
Stephen Rothwell [Thu, 3 Jun 2021 04:58:58 +0000 (14:58 +1000)]
Merge remote-tracking branch 'usb/usb-next'

3 years agoMerge remote-tracking branch 'driver-core/driver-core-next'
Stephen Rothwell [Thu, 3 Jun 2021 04:43:55 +0000 (14:43 +1000)]
Merge remote-tracking branch 'driver-core/driver-core-next'

3 years agoMerge remote-tracking branch 'ipmi/for-next'
Stephen Rothwell [Thu, 3 Jun 2021 04:42:17 +0000 (14:42 +1000)]
Merge remote-tracking branch 'ipmi/for-next'

3 years agoMerge remote-tracking branch 'leds/for-next'
Stephen Rothwell [Thu, 3 Jun 2021 04:40:07 +0000 (14:40 +1000)]
Merge remote-tracking branch 'leds/for-next'

3 years agoMerge remote-tracking branch 'drivers-x86/for-next'
Stephen Rothwell [Thu, 3 Jun 2021 04:38:25 +0000 (14:38 +1000)]
Merge remote-tracking branch 'drivers-x86/for-next'

# Conflicts:
# drivers/platform/surface/surface_aggregator_registry.c

3 years agoMerge remote-tracking branch 'percpu/for-next'
Stephen Rothwell [Thu, 3 Jun 2021 04:36:18 +0000 (14:36 +1000)]
Merge remote-tracking branch 'percpu/for-next'