]> www.infradead.org Git - users/jedix/linux-maple.git/log
users/jedix/linux-maple.git
3 years agomm: Move mas locking outside of munmap() path. maple_next
Liam R. Howlett [Thu, 11 Mar 2021 20:25:33 +0000 (15:25 -0500)]
mm: Move mas locking outside of munmap() path.

Now that there is a split variant that allows splitting to use a maple state,
move the locks to a more logical position.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/mmap: Add mas_split_vma() and use it for munmap()
Liam R. Howlett [Tue, 9 Mar 2021 03:47:27 +0000 (22:47 -0500)]
mm/mmap: Add mas_split_vma() and use it for munmap()

Use the maple state when splitting a node to not have to rewalk/reset the state on splits.
This is also needed to clean the locks up

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm: Return a bool from anon_vma_interval_tree_verify()
Liam R. Howlett [Sun, 7 Feb 2021 00:47:59 +0000 (19:47 -0500)]
mm: Return a bool from anon_vma_interval_tree_verify()

Added to allow printing which vma has the issue

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm: Remove vma linked list.
Liam R. Howlett [Mon, 4 Jan 2021 20:15:50 +0000 (15:15 -0500)]
mm: Remove vma linked list.

The vma linked list has been replaced by the maple tree iterators and
vma_next() vma_prev() functions.

A part of this change is also the iterators free_pgd_range(),
zap_page_range(), and unmap_single_vma()

The internal _vma_next() has been dropped.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agobpf: Remove VMA linked list
Liam R. Howlett [Wed, 7 Apr 2021 19:56:08 +0000 (15:56 -0400)]
bpf: Remove VMA linked list

Use vma_next() and remove reference to the start of the linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoarch/um/kernel/tlb: Stop using linked list
Liam R. Howlett [Tue, 16 Mar 2021 19:56:41 +0000 (15:56 -0400)]
arch/um/kernel/tlb: Stop using linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/util: Remove __vma_link_list() and __vma_unlink_list()
Liam R. Howlett [Mon, 4 Jan 2021 20:10:54 +0000 (15:10 -0500)]
mm/util: Remove __vma_link_list() and __vma_unlink_list()

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/swapfile: Use maple tree iterator instead of vma linked list
Liam R. Howlett [Mon, 4 Jan 2021 20:07:11 +0000 (15:07 -0500)]
mm/swapfile: Use maple tree iterator instead of vma linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/pagewalk: Use vma_next() instead of vma linked list
Liam R. Howlett [Mon, 4 Jan 2021 20:06:47 +0000 (15:06 -0500)]
mm/pagewalk: Use vma_next() instead of vma linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/oom_kill: Use maple tree iterators instead of vma linked list
Liam R. Howlett [Mon, 4 Jan 2021 20:06:25 +0000 (15:06 -0500)]
mm/oom_kill: Use maple tree iterators instead of vma linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/msync: Use vma_next() instead of vma linked list
Liam R. Howlett [Mon, 4 Jan 2021 20:04:14 +0000 (15:04 -0500)]
mm/msync: Use vma_next() instead of vma linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/mremap: Use vma_next() instead of vma linked list
Liam R. Howlett [Mon, 4 Jan 2021 20:03:49 +0000 (15:03 -0500)]
mm/mremap: Use vma_next() instead of vma linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/mprotect: Use maple tree navigation instead of vma linked list
Liam R. Howlett [Mon, 4 Jan 2021 20:02:28 +0000 (15:02 -0500)]
mm/mprotect: Use maple tree navigation instead of vma linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/mlock: Use maple tree iterators instead of vma linked list
Liam R. Howlett [Mon, 4 Jan 2021 20:00:14 +0000 (15:00 -0500)]
mm/mlock: Use maple tree iterators instead of vma linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/mempolicy: Use maple tree iterators instead of vma linked list
Liam R. Howlett [Mon, 4 Jan 2021 19:59:52 +0000 (14:59 -0500)]
mm/mempolicy: Use maple tree iterators instead of vma linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/memcontrol: Stop using mm->highest_vm_end
Liam R. Howlett [Mon, 4 Jan 2021 19:59:05 +0000 (14:59 -0500)]
mm/memcontrol: Stop using mm->highest_vm_end

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/madvise: Use vma_next instead of vma linked list
Liam R. Howlett [Mon, 4 Jan 2021 19:58:35 +0000 (14:58 -0500)]
mm/madvise: Use vma_next instead of vma linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/ksm: Use maple tree iterators instead of vma linked list
Liam R. Howlett [Mon, 4 Jan 2021 19:58:11 +0000 (14:58 -0500)]
mm/ksm: Use maple tree iterators instead of vma linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/khugepaged: Use maple tree iterators instead of vma linked list
Liam R. Howlett [Mon, 4 Jan 2021 19:56:58 +0000 (14:56 -0500)]
mm/khugepaged: Use maple tree iterators instead of vma linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/gup: Use maple tree navigation instead of linked list
Liam R. Howlett [Mon, 4 Jan 2021 19:56:20 +0000 (14:56 -0500)]
mm/gup: Use maple tree navigation instead of linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agokernel/sys: Use maple tree iterators instead of linked list
Liam R. Howlett [Mon, 4 Jan 2021 19:54:49 +0000 (14:54 -0500)]
kernel/sys: Use maple tree iterators instead of linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agokernel/sched/fair: Use maple tree iterators instead of linked list
Liam R. Howlett [Mon, 4 Jan 2021 19:54:07 +0000 (14:54 -0500)]
kernel/sched/fair: Use maple tree iterators instead of linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agokernel/events/uprobes: Use maple tree iterators instead of linked list
Liam R. Howlett [Mon, 4 Jan 2021 19:53:01 +0000 (14:53 -0500)]
kernel/events/uprobes: Use maple tree iterators instead of linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agokernel/events/core: Use maple tree iterators instead of linked list
Liam R. Howlett [Mon, 4 Jan 2021 19:52:39 +0000 (14:52 -0500)]
kernel/events/core: Use maple tree iterators instead of linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agokernel/acct: Use maple tree iterators instead of linked list
Liam R. Howlett [Mon, 4 Jan 2021 19:52:15 +0000 (14:52 -0500)]
kernel/acct: Use maple tree iterators instead of linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoipc/shm: Stop using the vma linked list
Liam R. Howlett [Mon, 4 Jan 2021 19:51:19 +0000 (14:51 -0500)]
ipc/shm: Stop using the vma linked list

When searching for a VMA, a maple state can be used instead of the linked list in
the mm_struct

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agofs/userfaultfd: Stop using vma linked list.
Liam R. Howlett [Mon, 4 Jan 2021 19:49:47 +0000 (14:49 -0500)]
fs/userfaultfd: Stop using vma linked list.

Don't use the mm_struct linked list or the vma->vm_next in prep for removal

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agofs/proc/task_mmu: Stop using linked list and highest_vm_end
Liam R. Howlett [Mon, 4 Jan 2021 19:47:48 +0000 (14:47 -0500)]
fs/proc/task_mmu: Stop using linked list and highest_vm_end

Remove references to mm_struct linked list and highest_vm_end for when they are removed

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agofs/proc/base: Use maple tree iterators in place of linked list
Liam R. Howlett [Mon, 4 Jan 2021 19:46:23 +0000 (14:46 -0500)]
fs/proc/base: Use maple tree iterators in place of linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agofs/exec: Use vma_next() instead of linked list
Liam R. Howlett [Mon, 4 Jan 2021 19:45:37 +0000 (14:45 -0500)]
fs/exec: Use vma_next() instead of linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agofs/coredump: Use maple tree iterators in place of linked list
Liam R. Howlett [Mon, 4 Jan 2021 19:44:45 +0000 (14:44 -0500)]
fs/coredump: Use maple tree iterators in place of linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agofs/binfmt_elf: Use maple tree iterators for fill_files_note()
Liam R. Howlett [Mon, 4 Jan 2021 19:44:16 +0000 (14:44 -0500)]
fs/binfmt_elf: Use maple tree iterators for fill_files_note()

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agodrivers/tee/optee: Use maple tree iterators for __check_mem_type()
Liam R. Howlett [Mon, 4 Jan 2021 19:43:36 +0000 (14:43 -0500)]
drivers/tee/optee: Use maple tree iterators for __check_mem_type()

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agodrivers/misc/cxl: Use maple tree iterators for cxl_prefault_vma()
Liam R. Howlett [Mon, 4 Jan 2021 19:31:50 +0000 (14:31 -0500)]
drivers/misc/cxl: Use maple tree iterators for cxl_prefault_vma()

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoarch/xtensa: Use maple tree iterators for unmapped area
Liam R. Howlett [Mon, 4 Jan 2021 19:30:59 +0000 (14:30 -0500)]
arch/xtensa: Use maple tree iterators for unmapped area

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoarch/x86: Use maple tree iterators for vdso/vma
Liam R. Howlett [Mon, 4 Jan 2021 19:30:25 +0000 (14:30 -0500)]
arch/x86: Use maple tree iterators for vdso/vma

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoarch/s390: Use maple tree iterators instead of linked list.
Liam R. Howlett [Mon, 4 Jan 2021 19:29:19 +0000 (14:29 -0500)]
arch/s390: Use maple tree iterators instead of linked list.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoarch/powerpc: Remove mmap linked list from mm/book3s64/subpage_prot
Liam R. Howlett [Mon, 4 Jan 2021 19:26:34 +0000 (14:26 -0500)]
arch/powerpc: Remove mmap linked list from mm/book3s64/subpage_prot

Start using the maple tree

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoarch/powerpc: Remove mmap linked list from mm/book3s32/tlb
Liam R. Howlett [Mon, 4 Jan 2021 19:25:54 +0000 (14:25 -0500)]
arch/powerpc: Remove mmap linked list from mm/book3s32/tlb

Start using the maple tree

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoarch/parisc: Remove mmap linked list from kernel/cache
Liam R. Howlett [Mon, 4 Jan 2021 19:25:19 +0000 (14:25 -0500)]
arch/parisc: Remove mmap linked list from kernel/cache

Start using the maple tree

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoarch/arm64: Remove mmap linked list from vdso.
Liam R. Howlett [Mon, 4 Jan 2021 19:24:40 +0000 (14:24 -0500)]
arch/arm64: Remove mmap linked list from vdso.

Start using the maple tree

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm: Introduce vma_next() and vma_prev()
Liam R. Howlett [Mon, 4 Jan 2021 20:13:14 +0000 (15:13 -0500)]
mm: Introduce vma_next() and vma_prev()

Rename internal vma_next() to _vma_next().

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/mmap: Change do_brk_munmap() to use do_mas_align_munmap()
Liam R. Howlett [Tue, 1 Dec 2020 02:30:04 +0000 (21:30 -0500)]
mm/mmap: Change do_brk_munmap() to use do_mas_align_munmap()

do_brk_munmap() has already aligned the address and has a maple tree
state to be used.  Use the new do_mas_align_munmap() to avoid
unnecessary alignment and error checks.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/mmap: Reorganize munmap to use maple states
Liam R. Howlett [Thu, 19 Nov 2020 17:57:23 +0000 (12:57 -0500)]
mm/mmap:  Reorganize munmap to use maple states

Remove __do_munmap() in favour of do_munmap(), do_mas_munmap(), and
do_mas_align_munmap().

do_munmap() is a wrapper to create a maple state for any callers that
have not been converted to the maple tree.

do_mas_munmap() takes a maple state to mumap a range.  This is just a
small function which checks for error conditions and aligns the end of
the range.

do_mas_align_munmap() uses the aligned range to mumap a range.
do_mas_align_munmap() starts with the first VMA in the range, then finds
the last VMA in the range.  Both start and end are split if necessary.
Then the VMAs are unlocked and removed from the linked list at the same
time.  Followed by a single tree operation of overwriting the area in
with a NULL.  Finally, the detached list is unmapped and freed.

By reorganizing the munmap calls as outlined, it is now possible to
avoid extra work of aligning pre-aligned callers which are known to be
safe, avoid extra VMA lookups or tree walks for modifications.

detach_vmas_to_be_unmapped() is no longer used, so drop this code.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/mmap: Move mmap_region() below do_munmap()
Liam R. Howlett [Wed, 27 Jan 2021 15:25:13 +0000 (10:25 -0500)]
mm/mmap: Move mmap_region() below do_munmap()

Relocation of code for the next commit.  There should be no changes
here.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm: Remove vmacache
Liam R. Howlett [Mon, 16 Nov 2020 19:50:20 +0000 (14:50 -0500)]
mm: Remove vmacache

By using the maple tree and the maple tree state, the vmacache is no
longer beneficial and is complicating the VMA code.  Remove the vmacache
to reduce the work in keeping it up to date and code complexity.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/mmap: Use advanced maple tree API for mmap_region()
Liam R. Howlett [Tue, 10 Nov 2020 18:37:40 +0000 (13:37 -0500)]
mm/mmap: Use advanced maple tree API for mmap_region()

Changing mmap_region() to use the maple tree state and the advanced
maple tree interface allows for a lot less tree walking.

This change removes the last caller of munmap_vma_range(), so drop this
unused function.

Add vma_expand() to expand a VMA if possible by doing the necessary
hugepage check, uprobe_munmap of files, dcache flush, modifications then
undoing the detaches, etc.

Add vma_mas_link() helper to add a VMA to the linked list and maple tree
until the linked list is removed.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm: Use maple tree operations for find_vma_intersection() and find_vma()
Liam R. Howlett [Mon, 28 Sep 2020 19:50:19 +0000 (15:50 -0400)]
mm: Use maple tree operations for find_vma_intersection() and find_vma()

Move find_vma_intersection() to mmap.c and change implementation to
maple tree.

When searching for a vma within a range, it is easier to use the maple
tree interface.  This means the find_vma() call changes to a special
case of the find_vma_intersection().

Exported for kvm module.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/mmap: Change do_brk_flags() to expand existing VMA and add
Liam R. Howlett [Mon, 21 Sep 2020 14:47:34 +0000 (10:47 -0400)]
mm/mmap: Change do_brk_flags() to expand existing VMA and add
do_brk_munmap()

Avoid allocating a new VMA when it a vma modification can occur.  When a
brk() can expand or contract a VMA, then the single store operation will
only modify one index of the maple tree instead of causing a node to
split or coalesce.  This avoids unnecessary allocations/frees of maple
tree nodes and VMAs.

Use the advanced API for the maple tree to avoid unnecessary walks of
the tree.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/khugepaged: Optimize collapse_pte_mapped_thp() by using vma_lookup()
Liam R. Howlett [Thu, 8 Apr 2021 20:21:58 +0000 (16:21 -0400)]
mm/khugepaged: Optimize collapse_pte_mapped_thp() by using vma_lookup()

vma_lookup() will walk the vma tree once and not continue to look for
the next vma.  Since the exact vma is checked below, this is a more
optimal way of searching.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm: Optimize find_exact_vma() to use vma_lookup()
Liam R. Howlett [Thu, 8 Apr 2021 20:15:00 +0000 (16:15 -0400)]
mm: Optimize find_exact_vma() to use vma_lookup()

Use vma_lookup() to walk the tree to the start value requested.  If
the vma at the start does not match, then the answer is NULL and there
is no need to look at the next vma the way that find_vma() would.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoxen/privcmd: Optimized privcmd_ioctl_mmap() by using vma_lookup()
Liam R. Howlett [Thu, 8 Apr 2021 20:06:36 +0000 (16:06 -0400)]
xen/privcmd: Optimized privcmd_ioctl_mmap() by using vma_lookup()

vma_lookup() walks the VMA tree for a specific value, find_vma() will
search the tree after walking to a specific value.  It is more efficient
to only walk to the requested value as this case requires the address to
equal the vm_start.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm: Remove rb tree.
Liam R. Howlett [Fri, 24 Jul 2020 19:06:25 +0000 (15:06 -0400)]
mm: Remove rb tree.

Remove the RB tree and start using the maple tree for vm_area_struct
tracking.

Drop validate_mm() calls in expand_upwards() and expand_downwards() as
the lock is not held.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agokernel/fork: Use maple tree for dup_mmap() during forking
Liam R. Howlett [Fri, 24 Jul 2020 17:32:30 +0000 (13:32 -0400)]
kernel/fork: Use maple tree for dup_mmap() during forking

The maple tree was already tracking VMAs in this function by an earlier
commit, but the rbtree iterator was being used to iterate the list.
Change the iterator to use a maple tree native iterator, rcu locking,
and switch to the maple tree advanced API to avoid multiple walks of the
tree during insert operations.

anon_vma_fork() may enter the slow path and cause a schedule() call to
cause rcu issues.  Drop the rcu lock and reacquiring the lock.  There is
no harm in this approach as the mmap_sem is taken for write/read and
held across the schedule() call so the VMAs will not change.

Note that the bulk allocation of nodes is also happening here for
performance reasons.  The node calculations are done internally to the
tree and use the VMA count and assume the worst-case node requirements.
The VM_DONT_COPY flag does not allow for the most efficient copy method
of the tree and so a bulk loading algorithm is used.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/mmap: Use maple tree for unmapped_area{_topdown}
Liam R. Howlett [Fri, 24 Jul 2020 16:01:47 +0000 (12:01 -0400)]
mm/mmap: Use maple tree for unmapped_area{_topdown}

The maple tree code was added to find the unmapped area in a previous
commit and was checked against what the rbtree returned, but the actual
result was never used.  Start using the maple tree implementation and
remove the rbtree code. Note, the advanced maple tree interface is used
so the rcu locking is needed to be handled here or at a higher level.

Add kernel documentation comment for these functions.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/mmap: Use the maple tree for find_vma_prev() instead of the rbtree
Liam R. Howlett [Fri, 24 Jul 2020 15:48:08 +0000 (11:48 -0400)]
mm/mmap: Use the maple tree for find_vma_prev() instead of the rbtree

Use the maple tree's advanced API and a maple state to walk the tree for
the entry at the address or the next vma, then use the maple state to
walk back one entry to find the previous entry.  Note, the advanced
maple tree interface does not handle the rcu locking.

Add kernel documentation comments for this API.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/mmap: Use the maple tree in find_vma() instead of the rbtree.
Liam R. Howlett [Fri, 24 Jul 2020 15:38:39 +0000 (11:38 -0400)]
mm/mmap: Use the maple tree in find_vma() instead of the rbtree.

Using the maple tree interface mt_find() will handle the RCU locking and
will start searching at the address up to the limit, ULONG_MAX in this
case.

Add kernel documentation to this API.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm: Start tracking VMAs with maple tree
Liam R. Howlett [Fri, 24 Jul 2020 17:04:30 +0000 (13:04 -0400)]
mm: Start tracking VMAs with maple tree

Start tracking the VMAs with the new maple tree structure in parallel
with the rb_tree.  Add debug and trace events for maple tree operations
and duplicate the rb_tree that is created on forks into the maple tree.

In this commit, the maple tree is added to the mm_struct including the
mm_init struct, added support in required mm/mmap functions, added
tracking in kernel/fork for process forking, and used to find the
unmapped_area and checked against what the rbtree finds.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoMaple Tree: Add new data structure
Liam R. Howlett [Fri, 24 Jul 2020 17:02:52 +0000 (13:02 -0400)]
Maple Tree: Add new data structure

The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently.  There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface.  The first user that is covered in this
patch set is the vm_area_struct, where three data structures are
replaced by the maple tree: the augmented rbtree, the vma cache, and the
linked list of VMAs in the mm_struct.  The long term goal is to reduce
or remove the mmap_sem contention.

The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes.  With the increased branching factor, it is significantly shorter than
the rbtree so it has fewer cache misses.  The removal of the linked list
between subsequent entries also reduces the cache misses and the need to pull
in the previous and next VMA during many tree alterations.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
3 years agoradix tree test suite: Add support for slab bulk APIs
Liam R. Howlett [Tue, 25 Aug 2020 19:51:54 +0000 (15:51 -0400)]
radix tree test suite: Add support for slab bulk APIs

Add support for kmem_cache_free_bulk() and kmem_cache_alloc_bulk() to
the radix tree test suite.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoradix tree test suite: Add kmem_cache enhancements and pr_err
Liam R. Howlett [Thu, 27 Feb 2020 20:13:20 +0000 (15:13 -0500)]
radix tree test suite: Add kmem_cache enhancements and pr_err

Add kmem_cache_set_non_kernel(), a mechanism to allow a certain number
of kmem_cache_alloc requests to succeed even when GFP_KERNEL is not set
in the flags.

Add kmem_cache_get_alloc() to see the size of the allocated kmem_cache.

Add a define of pr_err to printk.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
3 years agomm/mmap: Use find_vma_intersection() in do_mmap() for overlap
Liam R. Howlett [Tue, 24 Nov 2020 21:23:38 +0000 (16:23 -0500)]
mm/mmap: Use find_vma_intersection() in do_mmap() for overlap

Using find_vma_intersection() avoids the need for a temporary variable
and makes the code cleaner.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/mmap: Introduce unlock_range() for code cleanup
Liam R. Howlett [Mon, 21 Sep 2020 19:37:14 +0000 (15:37 -0400)]
mm/mmap: Introduce unlock_range() for code cleanup

Both __do_munmap() and exit_mmap() unlock a range of VMAs using almost
identical code blocks.  Replace both blocks by a static inline function.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
3 years agomm/mempolicy: Use vma_lookup() in __access_remote_vm()
Liam R. Howlett [Mon, 3 May 2021 23:52:48 +0000 (19:52 -0400)]
mm/mempolicy: Use vma_lookup() in __access_remote_vm()

vma_lookup() finds the vma of a specific address with a cleaner
interface and is more readable.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/memory.c: Use vma_lookup() in __access_remote_vm()
Liam R. Howlett [Mon, 1 Mar 2021 19:30:38 +0000 (14:30 -0500)]
mm/memory.c: Use vma_lookup() in __access_remote_vm()

Use vma_lookup() to find the VMA at a specific address.  As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/mremap: Use vma_lookup() in vma_to_resize()
Liam R. Howlett [Thu, 8 Apr 2021 20:26:13 +0000 (16:26 -0400)]
mm/mremap: Use vma_lookup() in vma_to_resize()

Use vma_lookup() to find the VMA at a specific address.  As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/migrate: Use vma_lookup() in do_pages_stat_array()
Liam R. Howlett [Thu, 8 Apr 2021 20:24:41 +0000 (16:24 -0400)]
mm/migrate: Use vma_lookup() in do_pages_stat_array()

Use vma_lookup() to find the VMA at a specific address.  As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm/ksm: Use vma_lookup() in find_mergeable_vma()
Liam R. Howlett [Thu, 8 Apr 2021 20:23:43 +0000 (16:23 -0400)]
mm/ksm: Use vma_lookup() in find_mergeable_vma()

Use vma_lookup() to find the VMA at a specific address.  As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agolib/test_hmm: Use vma_lookup() in dmirror_migrate()
Liam R. Howlett [Thu, 8 Apr 2021 20:20:21 +0000 (16:20 -0400)]
lib/test_hmm: Use vma_lookup() in dmirror_migrate()

Use vma_lookup() to find the VMA at a specific address.  As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agokernel/events/uprobes: Use vma_lookup() in find_active_uprobe()
Liam R. Howlett [Thu, 8 Apr 2021 20:18:41 +0000 (16:18 -0400)]
kernel/events/uprobes: Use vma_lookup() in find_active_uprobe()

Use vma_lookup() to find the VMA at a specific address.  As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomisc/sgi-gru/grufault: Use vma_lookup() in gru_find_vma()
Liam R. Howlett [Thu, 8 Apr 2021 20:01:21 +0000 (16:01 -0400)]
misc/sgi-gru/grufault: Use vma_lookup() in gru_find_vma()

Use vma_lookup() to find the VMA at a specific address.  As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomedia: videobuf2: Use vma_lookup() in get_vaddr_frames()
Liam R. Howlett [Thu, 8 Apr 2021 17:35:15 +0000 (13:35 -0400)]
media: videobuf2: Use vma_lookup() in get_vaddr_frames()

vma_lookup() finds the vma of a specific address with a cleaner
interface and is more readable.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agodrm/amdgpu: Use vma_lookup() in amdgpu_ttm_tt_get_user_pages()
Liam R. Howlett [Thu, 8 Apr 2021 17:28:25 +0000 (13:28 -0400)]
drm/amdgpu: Use vma_lookup() in amdgpu_ttm_tt_get_user_pages()

Use vma_lookup() to find the VMA at a specific address.  As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agonet/ipv5/tcp: Use vma_lookup() in tcp_zerocopy_receive()
Liam R. Howlett [Thu, 8 Apr 2021 17:16:07 +0000 (13:16 -0400)]
net/ipv5/tcp: Use vma_lookup() in tcp_zerocopy_receive()

Use vma_lookup() to find the VMA at a specific address.  As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agovfio: Use vma_lookup() instead of find_vma_intersection()
Liam R. Howlett [Mon, 1 Mar 2021 19:27:28 +0000 (14:27 -0500)]
vfio: Use vma_lookup() instead of find_vma_intersection()

vma_lookup() finds the vma of a specific address with a cleaner
interface and is more readable.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agovirt/kvm: Use vma_lookup() instead of find_vma_intersection()
Liam R. Howlett [Mon, 1 Mar 2021 19:26:37 +0000 (14:26 -0500)]
virt/kvm: Use vma_lookup() instead of find_vma_intersection()

vma_lookup() finds the vma of a specific address with a cleaner
interface and is more readable.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agox86/sgx: Use vma_lookup() in sgx_encl_find()
Liam R. Howlett [Thu, 8 Apr 2021 17:26:35 +0000 (13:26 -0400)]
x86/sgx: Use vma_lookup() in sgx_encl_find()

Use vma_lookup() to find the VMA at a specific address.  As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoarch/m68k/kernel/sys_m68k: Use vma_lookup() in sys_cacheflush()
Liam R. Howlett [Fri, 9 Apr 2021 00:54:37 +0000 (20:54 -0400)]
arch/m68k/kernel/sys_m68k: Use vma_lookup() in sys_cacheflush()

Using vma_lookup() enables for simplified checking of the returned vma
to ensure the end address also falls within the same vma.  The start
address must be in the returned vma from vma_lookup().

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoarch/mips/kernel/traps: Use vma_lookup() instead of find_vma()
Liam R. Howlett [Mon, 1 Mar 2021 19:29:09 +0000 (14:29 -0500)]
arch/mips/kernel/traps: Use vma_lookup() instead of find_vma()

Use vma_lookup() to find the VMA at a specific address.  As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoarch/powerpc/kvm/book3s: Use vma_lookup() in kvmppc_hv_setup_htab_rma()
Liam R. Howlett [Thu, 8 Apr 2021 17:25:21 +0000 (13:25 -0400)]
arch/powerpc/kvm/book3s: Use vma_lookup() in kvmppc_hv_setup_htab_rma()

Using vma_lookup() removes the requirement to check if the address is
within the returned vma.  The code is easier to understand and more
compact.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoarch/powerpc/kvm/book3s_hv_uvmem: Use vma_lookup() instead of find_vma_intersection()
Liam R. Howlett [Mon, 1 Mar 2021 19:29:29 +0000 (14:29 -0500)]
arch/powerpc/kvm/book3s_hv_uvmem: Use vma_lookup() instead of find_vma_intersection()

vma_lookup() finds the vma of a specific address with a cleaner
interface and is more readable.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoarch/arm64/kvm: Use vma_lookup() instead of find_vma_intersection()
Liam R. Howlett [Mon, 1 Mar 2021 19:28:44 +0000 (14:28 -0500)]
arch/arm64/kvm: Use vma_lookup() instead of find_vma_intersection()

vma_lookup() finds the vma of a specific address with a cleaner
interface and is more readable.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoarch/arc/kernel/troubleshoot: use vma_lookup() instead of find_vma()
Liam R. Howlett [Mon, 1 Mar 2021 19:25:27 +0000 (14:25 -0500)]
arch/arc/kernel/troubleshoot: use vma_lookup() instead of find_vma()

Use vma_lookup() to find the VMA at a specific address.  As vma_lookup()
will return NULL if the address is not within any VMA, the start address
no longer needs to be validated.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agodrm/i915/selftests: Use vma_lookup() in __igt_mmap()
Liam R. Howlett [Thu, 8 Apr 2021 17:32:57 +0000 (13:32 -0400)]
drm/i915/selftests: Use vma_lookup() in __igt_mmap()

vma_lookup() will look up the vma at a specific address.  find_vma()
will start the search for a specific address and continue upwards.  This
fixes an issue with the selftest as the returned vma may not be the
newly created vma, but simply the vma at a higher address.

Fixes: 6fedafacae1b (drm/i915/selftests: Wrap vm_mmap() around GEM
objects
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agomm: Add vma_lookup()
Liam R. Howlett [Mon, 1 Mar 2021 19:01:06 +0000 (14:01 -0500)]
mm: Add vma_lookup()

Many places in the kernel use find_vma() to get a vma and then check the
start address of the vma to ensure the next vma was not returned.

Other places use the find_vma_intersection() call with add, addr + 1 as
the range; looking for just the vma at a specific address.

The third use of find_vma() is by developers who do not know that the
function starts searching at the provided address upwards for the next
vma.  This results in a bug that is often overlooked for a long time.

Adding the new vma_lookup() function will allow for cleaner code by
removing the find_vma() calls which check limits, making
find_vma_intersection() calls of a single address to be shorter, and
potentially reduce the incorrect uses of find_vma().

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
3 years agoAdd linux-next specific files for 20210506
Stephen Rothwell [Thu, 6 May 2021 02:30:55 +0000 (12:30 +1000)]
Add linux-next specific files for 20210506

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agoMerge branch 'akpm/master'
Stephen Rothwell [Thu, 6 May 2021 01:47:44 +0000 (11:47 +1000)]
Merge branch 'akpm/master'

3 years agomemfd_secret: use unsigned int rather than long as syscall flags type
Mike Rapoport [Thu, 22 Apr 2021 06:43:28 +0000 (16:43 +1000)]
memfd_secret: use unsigned int rather than long as syscall flags type

Yuri Norov says:

  If parameter size is the same for native and compat ABIs, we may
  wire a syscall made by compat client to native handler. This is
  true for unsigned int, but not true for unsigned long or pointer.

  That's why I suggest using unsigned int and so avoid creating compat
  entry point.

Use unsigned int as the type of the flags parameter in memfd_secret()
system call.

Link: https://lkml.kernel.org/r/20210331142345.27532-1-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agosecretmem: test: add basic selftest for memfd_secret(2)
Mike Rapoport [Thu, 22 Apr 2021 06:43:28 +0000 (16:43 +1000)]
secretmem: test: add basic selftest for memfd_secret(2)

The test verifies that file descriptor created with memfd_secret does not
allow read/write operations, that secret memory mappings respect
RLIMIT_MEMLOCK and that remote accesses with process_vm_read() and
ptrace() to the secret memory fail.

Link: https://lkml.kernel.org/r/20210303162209.8609-10-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agomemfd_secret: use unsigned int rather than long as syscall flags type
Mike Rapoport [Thu, 22 Apr 2021 06:43:28 +0000 (16:43 +1000)]
memfd_secret: use unsigned int rather than long as syscall flags type

Yuri Norov says:

  If parameter size is the same for native and compat ABIs, we may
  wire a syscall made by compat client to native handler. This is
  true for unsigned int, but not true for unsigned long or pointer.

  That's why I suggest using unsigned int and so avoid creating compat
  entry point.

Use unsigned int as the type of the flags parameter in memfd_secret()
system call.

Link: https://lkml.kernel.org/r/20210331142345.27532-1-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agoarch, mm: wire up memfd_secret system call where relevant
Mike Rapoport [Thu, 22 Apr 2021 06:43:28 +0000 (16:43 +1000)]
arch, mm: wire up memfd_secret system call where relevant

Wire up memfd_secret system call on architectures that define
ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86.

Link: https://lkml.kernel.org/r/20210303162209.8609-9-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agoPM: hibernate: disable when there are active secretmem users
Mike Rapoport [Thu, 22 Apr 2021 06:43:27 +0000 (16:43 +1000)]
PM: hibernate: disable when there are active secretmem users

It is unsafe to allow saving of secretmem areas to the hibernation
snapshot as they would be visible after the resume and this essentially
will defeat the purpose of secret memory mappings.

Prevent hibernation whenever there are active secret memory users.

Link: https://lkml.kernel.org/r/20210303162209.8609-8-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agosecretmem: optimize page_is_secretmem()
Mike Rapoport [Thu, 22 Apr 2021 06:43:27 +0000 (16:43 +1000)]
secretmem: optimize page_is_secretmem()

Kernel test robot reported -4.2% regression of
will-it-scale.per_thread_ops due to commit "mm: introduce memfd_secret
system call to create "secret" memory areas".

The perf profile of the test indicated that the regression is caused by
page_is_secretmem() called from gup_pte_range() (inlined by
gup_pgd_range):

 27.76  +2.5  30.23       perf-profile.children.cycles-pp.gup_pgd_range
  0.00  +3.2   3.19    2%  perf-profile.children.cycles-pp.page_mapping
  0.00  +3.7   3.66    2%  perf-profile.children.cycles-pp.page_is_secretmem

Further analysis showed that the slow down happens because neither
page_is_secretmem() nor page_mapping() are not inline and moreover,
multiple page flags checks in page_mapping() involve calling
compound_head() several times for the same page.

Make page_is_secretmem() inline and replace page_mapping() with page flag
checks that do not imply page-to-head conversion.

Link: https://lkml.kernel.org/r/20210420150049.14031-3-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agosecretmem/gup: don't check if page is secretmem without reference
Mike Rapoport [Thu, 22 Apr 2021 06:43:27 +0000 (16:43 +1000)]
secretmem/gup: don't check if page is secretmem without reference

The check in gup_pte_range() whether a page belongs to a secretmem mapping
is performed before grabbing the page reference.

To avoid potential race move the check after try_grab_compound_head().

Link: https://lkml.kernel.org/r/20210420150049.14031-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agomemfd_secret: use unsigned int rather than long as syscall flags type
Mike Rapoport [Thu, 22 Apr 2021 06:43:27 +0000 (16:43 +1000)]
memfd_secret: use unsigned int rather than long as syscall flags type

Yuri Norov says:

  If parameter size is the same for native and compat ABIs, we may
  wire a syscall made by compat client to native handler. This is
  true for unsigned int, but not true for unsigned long or pointer.

  That's why I suggest using unsigned int and so avoid creating compat
  entry point.

Use unsigned int as the type of the flags parameter in memfd_secret()
system call.

Link: https://lkml.kernel.org/r/20210331142345.27532-1-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agomm: introduce memfd_secret system call to create "secret" memory areas
Mike Rapoport [Thu, 22 Apr 2021 06:43:27 +0000 (16:43 +1000)]
mm: introduce memfd_secret system call to create "secret" memory areas

Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.

The secretmem feature is off by default and the user must explicitly
enable it at the boot time.

Once secretmem is enabled, the user will be able to create a file
descriptor using the memfd_secret() system call.  The memory areas created
by mmap() calls from this file descriptor will be unmapped from the kernel
direct map and they will be only mapped in the page table of the processes
that have access to the file descriptor.

The file descriptor based memory has several advantages over the
"traditional" mm interfaces, such as mlock(), mprotect(), madvise().  File
descriptor approach allows explict and controlled sharing of the memory
areas, it allows to seal the operations.  Besides, file descriptor based
memory paves the way for VMMs to remove the secret memory range from the
userpace hipervisor process, for instance QEMU.  Andy Lutomirski says:

  "Getting fd-backed memory into a guest will take some possibly major
  work in the kernel, but getting vma-backed memory into a guest without
  mapping it in the host user address space seems much, much worse."

memfd_secret() is made a dedicated system call rather than an extention to
memfd_create() because it's purpose is to allow the user to create more
secure memory mappings rather than to simply allow file based access to
the memory.  Nowadays a new system call cost is negligible while it is way
simpler for userspace to deal with a clear-cut system calls than with a
multiplexer or an overloaded syscall.  Moreover, the initial
implementation of memfd_secret() is completely distinct from
memfd_create() so there is no much sense in overloading memfd_create()
to begin with.  If there will be a need for code sharing between these
implementation it can be easily achieved without a need to adjust user
visible APIs.

The secret memory remains accessible in the process context using uaccess
primitives, but it is not exposed to the kernel otherwise; secret memory
areas are removed from the direct map and functions in the
follow_page()/get_user_page() family will refuse to return a page that
belongs to the secret memory area.

Once there will be a use case that will require exposing secretmem to the
kernel it will be an opt-in request in the system call flags so that user
would have to decide what data can be exposed to the kernel.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which
affects the system performance.  However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "...  can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice".  Hence, it is sufficient to
have secretmem disabled by default with the ability of a system
administrator to enable it at boot time.

Pages in the secretmem regions are unevictable and unmovable to avoid
accidental exposure of the sensitive data via swap or during page
migration.

Since the secretmem mappings are locked in memory they cannot exceed
RLIMIT_MEMLOCK.  Since these mappings are already locked independently
from mlock(), an attempt to mlock()/munlock() secretmem range would fail
and mlockall()/munlockall() will ignore secretmem mappings.

However, unlike mlock()ed memory, secretmem currently behaves more like
long-term GUP: secretmem mappings are unmovable mappings directly consumed
by user space.  With default limits, there is no excessive use of
secretmem and it poses no real problem in combination with
ZONE_MOVABLE/CMA, but in the future this should be addressed to allow
balanced use of large amounts of secretmem along with ZONE_MOVABLE/CMA.

A page that was a part of the secret memory area is cleared when it is
freed to ensure the data is not exposed to the next user of that page.

The following example demonstrates creation of a secret mapping (error
handling is omitted):

fd = memfd_secret(0);
ftruncate(fd, MAP_SIZE);
ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
   MAP_SHARED, fd, 0);

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
Link: https://lkml.kernel.org/r/20210303162209.8609-7-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agoset_memory: allow querying whether set_direct_map_*() is actually enabled
Mike Rapoport [Thu, 22 Apr 2021 06:43:26 +0000 (16:43 +1000)]
set_memory: allow querying whether set_direct_map_*() is actually enabled

On arm64, set_direct_map_*() functions may return 0 without actually
changing the linear map.  This behaviour can be controlled using kernel
parameters, so we need a way to determine at runtime whether calls to
set_direct_map_invalid_noflush() and set_direct_map_default_noflush() have
any effect.

Extend set_memory API with can_set_direct_map() function that allows
checking if calling set_direct_map_*() will actually change the page
table, replace several occurrences of open coded checks in arm64 with the
new function and provide a generic stub for architectures that always
modify page tables upon calls to set_direct_map APIs.

[arnd@arndb.de: arm64: kfence: fix header inclusion ]
Link: https://lkml.kernel.org/r/20210303162209.8609-6-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agoset_memory: allow set_direct_map_*_noflush() for multiple pages
Mike Rapoport [Thu, 22 Apr 2021 06:43:26 +0000 (16:43 +1000)]
set_memory: allow set_direct_map_*_noflush() for multiple pages

The underlying implementations of set_direct_map_invalid_noflush() and
set_direct_map_default_noflush() allow updating multiple contiguous pages
at once.

Add numpages parameter to set_direct_map_*_noflush() to expose this
ability with these APIs.

Link: https://lkml.kernel.org/r/20210303162209.8609-5-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64]
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agoriscv/Kconfig: make direct map manipulation options depend on MMU
Mike Rapoport [Thu, 22 Apr 2021 06:43:26 +0000 (16:43 +1000)]
riscv/Kconfig: make direct map manipulation options depend on MMU

ARCH_HAS_SET_DIRECT_MAP and ARCH_HAS_SET_MEMORY configuration options have
no meaning when CONFIG_MMU is disabled and there is no point to enable
them for the nommu case.

Add an explicit dependency on MMU for these options.

Link: https://lkml.kernel.org/r/20210303162209.8609-4-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
3 years agommap: make mlock_future_check() global
Mike Rapoport [Thu, 22 Apr 2021 06:43:26 +0000 (16:43 +1000)]
mmap: make mlock_future_check() global

Patch series "mm: introduce memfd_secret system call to create "secret" memory areas", v18.

This is an implementation of "secret" mappings backed by a file
descriptor.

The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call.  The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping.  The pages in that mapping will be marked as not present
in the direct map and will be present only in the page table of the owning
mm.

Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.

Additionally, in the future the secret mappings may be used as a mean to
protect guest memory in a virtual machine host.

For demonstration of secret memory usage we've created a userspace library

https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git

that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it.  We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.

Hiding secret memory mappings behind an anonymous file allows usage of the
page cache for tracking pages allocated for the "secret" mappings as well
as using address_space_operations for e.g.  page migration callbacks.

The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native"
mm ABIs in the future.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which
affects the system performance.  However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "...  can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice".  Hence, it is sufficient to
have secretmem disabled by default with the ability of a system
administrator to enable it at boot time.

In addition, there is also a long term goal to improve management of the
direct map.

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

This patch (of 8):

It will be used by the upcoming secret memory implementation.

Link: https://lkml.kernel.org/r/20210303162209.8609-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20210303162209.8609-3-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>