www.infradead.org Git - users/jedix/linux-maple.git/log

Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/upstream-cherry-picks: (55 commits)
  userfaultfd: fix SIGBUS resulting from false rwsem wakeups
  userfaultfd: hugetlbfs: fix add copy_huge_page_from_user for hugetlb userfaultfd support
  userfaultfd: hugetlbfs: reserve count on error in __mcopy_atomic_hugetlb
  userfaultfd: hugetlbfs: gup: support VM_FAULT_RETRY
  userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges
  userfaultfd: hugetlbfs: add userfaultfd_hugetlb test
  userfaultfd: hugetlbfs: allow registration of ranges containing huge pages
  userfaultfd: hugetlbfs: add userfaultfd hugetlb hook
  userfaultfd: hugetlbfs: fix __mcopy_atomic_hugetlb retry/error processing
  userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY
  userfaultfd: hugetlbfs: add hugetlb_mcopy_atomic_pte for userfaultfd support
  userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support
  mm/hugetlb: fix huge page reservation leak in private mapping error paths
  mm/hugetlb: fix huge page reserve accounting for private mappings
  userfaultfd: don't pin the user memory in userfaultfd_file_create()
  userfaultfd: don't block on the last VM updates at exit time
  sparc: add waitfd to 32 bit system call tables
  userfaultfd: remove kernel header include from uapi header
  userfaultfd: register uapi generic syscall (aarch64)
  userfaultfd: selftest: don't error out if pthread_mutex_t isn't identical
  ...

Conflicts:
arch/x86/syscalls/syscall_32.tbl
arch/x86/syscalls/syscall_64.tbl
fs/Makefile
include/linux/mm_types.h
mm/hugetlb.c

userfaultfd: fix SIGBUS resulting from false rwsem wakeups

Orabug: 21685254

With >=32 CPUs the userfaultfd selftest triggered a graceful but
unexpected SIGBUS because VM_FAULT_RETRY was returned by
handle_userfault() despite the UFFDIO_COPY wasn't completed.

This seems caused by rwsem waking the thread blocked in handle_userfault()
and we can't run up_read() before the wait_event sequence is complete.

Keeping the wait_even sequence identical to the first one, would require
running userfaultfd_must_wait() again to know if the loop should be
repeated, and it would also require retaking the rwsem and revalidating
the whole vma status.

It seems simpler to wait the targeted wakeup so that if false wakeups
materialize we still wait for our specific wakeup event, unless of course
there are signals or the uffd was released.

Debug code collecting the stack trace of the wakeup showed this:

$ ./userfaultfd 100 99999
nr_pages: 25600, nr_pages_per_cpu: 800
bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2
bounces: 99997, mode: rnd ver poll, Bus error (core dumped)

   [<ffffffff8102e19b>] save_stack_trace+0x2b/0x50
   [<ffffffff8110b1d6>] try_to_wake_up+0x2a6/0x580
   [<ffffffff8110b502>] wake_up_q+0x32/0x70
   [<ffffffff8113d7a0>] rwsem_wake+0xe0/0x120
   [<ffffffff8148361b>] call_rwsem_wake+0x1b/0x30
   [<ffffffff81131d5b>] up_write+0x3b/0x40
   [<ffffffff812280fc>] vm_mmap_pgoff+0x9c/0xc0
   [<ffffffff81244b79>] SyS_mmap_pgoff+0x1a9/0x240
   [<ffffffff810228f2>] SyS_mmap+0x22/0x30
   [<ffffffff81842dfc>] entry_SYSCALL_64_fastpath+0x1f/0xbd
   [<ffffffffffffffff>] 0xffffffffffffffff
FAULT_FLAG_ALLOW_RETRY missing 70
CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G        W       4.8.0+ #30
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
0000000000000000 ffff880218027d40 ffffffff814749f6 ffff8802180c4a80
ffff880231ead4c0 ffff880218027e18 ffffffff812e0102 ffffffff81842b1c
ffff880218027dd0 ffff880218082960 0000000000000000 0000000000000000
Call Trace:
[<ffffffff814749f6>] dump_stack+0xb8/0x112
[<ffffffff812e0102>] handle_userfault+0x572/0x650
[<ffffffff81842b1c>] ? _raw_spin_unlock_irq+0x2c/0x50
[<ffffffff812407b4>] ? handle_mm_fault+0xcf4/0x1520
[<ffffffff81240d7e>] ? handle_mm_fault+0x12be/0x1520
[<ffffffff81240d8b>] handle_mm_fault+0x12cb/0x1520
[<ffffffff81483588>] ? call_rwsem_down_read_failed+0x18/0x30
[<ffffffff810572c5>] __do_page_fault+0x175/0x500
[<ffffffff810576f1>] trace_do_page_fault+0x61/0x270
[<ffffffff81050739>] do_async_page_fault+0x19/0x90
[<ffffffff81843ef5>] async_page_fault+0x25/0x30

This always happens when the main userfault selftest thread is running
clone() while glibc runs either mprotect or mmap (both taking mmap_sem
down_write()) to allocate the thread stack of the background threads,
while locking/userfault threads already run at full throttle and are
susceptible to false wakeups that may cause handle_userfault() to return
before than expected (which results in graceful SIGBUS at the next
attempt).

This was reproduced only with >=32 CPUs because the loop to start the
thread where clone() is too quick with fewer CPUs, while with 32 CPUs
there's already significant activity on ~32 locking and userfault threads
when the last background threads are started with clone().

This >=32 CPUs SMP race condition is likely reproducible only with the
selftest because of the much heavier userfault load it generates if
compared to real apps.

We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a
patch floating around that provides it also hidden this problem but in
reality only is successfully at hiding the problem.  False wakeups could
still happen again the second time handle_userfault() is invoked, even if
it's a so rare race condition that getting false wakeups twice in a row is
impossible to reproduce.  This full fix is needed for correctness, the
only alternative would be to allow VM_FAULT_RETRY to be returned
infinitely.  With this fix the WP support can stick to a strict "one more"
VM_FAULT_RETRY logic (no need of returning it infinite times to avoid the
SIGBUS).

Link: http://lkml.kernel.org/r/20170111005535.13832-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from linux-next next-20170117
commit d08f4a51bce6143390469c92e01d201ac4d68890)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: hugetlbfs: fix add copy_huge_page_from_user for hugetlb userfaultfd support

Orabug: 21685254

Was in Andrew's patch series on January 17, 2017 as:
userfaultfd hugetlbfs fix __mcopy_atomic_hugetlb retry error processing fix fix

kunmap() takes a page*, per Hugh

Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Ported to UEK ]
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: hugetlbfs: reserve count on error in __mcopy_atomic_hugetlb

Orabug: 21685254

If __mcopy_atomic_hugetlb exits with an error, put_page will be called if
a huge page was allocated and needs to be freed. If a reservation was
associated with the huge page, the PagePrivate flag will be set. Clear
PagePrivate before calling put_page/free_huge_page so that the global
reservation count is not incremented.

Link: http://lkml.kernel.org/r/20161216144821.5183-26-aarcange@redhat.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from linux-next next-20170117
commit afcc2616d4284859d8f70bbdfa6c9ca92fbb08ed)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: hugetlbfs: gup: support VM_FAULT_RETRY

Orabug: 21685254

Add support for VM_FAULT_RETRY to follow_hugetlb_page() so that
get_user_pages_unlocked/locked and "nonblocking/FOLL_NOWAIT" features will
work on hugetlbfs. This is required for fully functional userfaultfd
non-present support on hugetlbfs.

Link: http://lkml.kernel.org/r/20161216144821.5183-25-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from linux-next next-20170117
commit b33127bd2a0367d093bfeb1abd147754ff90e670)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Conflicts:
mm/hugetlb.c

userfaultfd: hugetlbfs: userfaultfd_huge_must_wait for hugepmd ranges

Orabug: 21685254

Add routine userfaultfd_huge_must_wait which has the same functionality as
the existing userfaultfd_must_wait routine. Only difference is that new
routine must handle page table structure for hugepmd vmas.

Link: http://lkml.kernel.org/r/20161216144821.5183-24-aarcange@redhat.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from linux-next next-20170117
commit 36a121cb303d54b24bc4e590faf813daec1025d7)
[ Ported to UEK ]
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: hugetlbfs: add userfaultfd_hugetlb test

Orabug: 21685254

Test userfaultfd hugetlb functionality by using the existing testing
method (in userfaultfd.c).  Instead of an anonymous memeory, a hugetlbfs
file is mmap'ed private.  In this way fallocate hole punch can be used to
release pages.  This is because madvise(MADV_DONTNEED) is not supported
for huge pages.

Use the same file, but create wrappers for allocating ranges and releasing
pages.  Compile userfaultfd.c with HUGETLB_TEST defined to produce an
executable to test userfaultfd hugetlb functionality.

Link: http://lkml.kernel.org/r/20161216144821.5183-23-aarcange@redhat.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from linux-next next-20170117
commit 44d5f4b70ff826f43791aa0fc18ca8fd02f1b432)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Conflicts:
tools/testing/selftests/vm/Makefile
tools/testing/selftests/vm/run_vmtests

Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: hugetlbfs: allow registration of ranges containing huge pages

Orabug: 21685254

Expand the userfaultfd_register/unregister routines to allow VM_HUGETLB
vmas. huge page alignment checking is performed after a VM_HUGETLB vma is
encountered.

Also, since there is no UFFDIO_ZEROPAGE support for huge pages do not
return that as a valid ioctl method for huge page ranges.

Link: http://lkml.kernel.org/r/20161216144821.5183-22-aarcange@redhat.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from linux-next next-20170117
commit 6be4576b101b7026f72ec240f393c1dd5dfa02da)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Conflicts:
fs/userfaultfd.c

Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: hugetlbfs: add userfaultfd hugetlb hook

Orabug: 21685254

When processing a hugetlb fault for no page present, check the vma to
determine if faults are to be handled via userfaultfd. If so, drop the
hugetlb_fault_mutex and call handle_userfault().

Link: http://lkml.kernel.org/r/20161216144821.5183-21-aarcange@redhat.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from linux-next next-20170117
commit 20609d5667d3db6545062527036875e68451086a)
[ Ported to UEK ]
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: hugetlbfs: fix __mcopy_atomic_hugetlb retry/error processing

Orabug: 21685254

The new routine copy_huge_page_from_user() uses kmap_atomic() to map
PAGE_SIZE pages.  However, this prevents page faults in the subsequent
call to copy_from_user().  This is OK in the case where the routine is
copied with mmap_sema held.  However, in another case we want to allow
page faults.  So, add a new argument allow_pagefault to indicate if the
routine should allow page faults.

Link: http://lkml.kernel.org/r/20161216144821.5183-20-aarcange@redhat.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from linux-next next-20170117
commit 939d5ff6c4e48f72b3261baf8d4b82f54caf4561)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: hugetlbfs: add __mcopy_atomic_hugetlb for huge page UFFDIO_COPY

Orabug: 21685254

__mcopy_atomic_hugetlb performs the UFFDIO_COPY operation for huge pages.
It is based on the existing __mcopy_atomic routine for normal pages.
Unlike normal pages, there is no huge page support for the UFFDIO_ZEROPAGE
operation.

Link: http://lkml.kernel.org/r/20161216144821.5183-19-aarcange@redhat.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from linux-next next-20170117
commit 46a4eb48229d92b7cc82f4f375bc713602486e4d)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: hugetlbfs: add hugetlb_mcopy_atomic_pte for userfaultfd support

Orabug: 21685254

hugetlb_mcopy_atomic_pte is the low level routine that implements the
userfaultfd UFFDIO_COPY command. It is based on the existing
mcopy_atomic_pte routine with modifications for huge pages.

Link: http://lkml.kernel.org/r/20161216144821.5183-18-aarcange@redhat.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from linux-next next-20170117
commit 262653c3c59ca4294416b8fe43a381542f40fd67)
[ Ported to UEK ]
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support

Orabug: 21685254

userfaultfd UFFDIO_COPY allows user level code to copy data to a page at
fault time. The data is copied from user space to a newly allocated huge
page. The new routine copy_huge_page_from_user performs this copy.

Link: http://lkml.kernel.org/r/20161216144821.5183-17-aarcange@redhat.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from linux-next next-20170117
commit 33424a2bf04a630438a7bb8d53a2477ba7527164)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

mm/hugetlb: fix huge page reservation leak in private mapping error paths

Orabug: 21685254

Error paths in hugetlb_cow() and hugetlb_no_page() may free a newly
allocated huge page.

If a reservation was associated with the huge page, alloc_huge_page()
consumed the reservation while allocating.  When the newly allocated
page is freed in free_huge_page(), it will increment the global
reservation count.  However, the reservation entry in the reserve map
will remain.

This is not an issue for shared mappings as the entry in the reserve map
indicates a reservation exists.  But, an entry in a private mapping
reserve map indicates the reservation was consumed and no longer exists.
This results in an inconsistency between the reserve map and the global
reservation count.  This 'leaks' a reserved huge page.

Create a new routine restore_reserve_on_error() to restore the reserve
entry in these specific error paths.  This routine makes use of a new
function vma_add_reservation() which will add a reserve entry for a
specific address/page.

In general, these error paths were rarely (if ever) taken on most
architectures.  However, powerpc contained arch specific code that that
resulted in an extra fault and execution of these error paths on all
private mappings.

Fixes: 67961f9db8c4 ("mm/hugetlb: fix huge page reserve accounting for private mappings)
Link: http://lkml.kernel.org/r/1476933077-23091-2-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill A . Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 96b96a96ddee4ba08ce4aeb8a558a3271fd4a7a7)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Conflicts:
mm/hugetlb.c

Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

mm/hugetlb: fix huge page reserve accounting for private mappings

Orabug: 21685254 (userfaultfd hugetlb selftest depends on this fix)

When creating a private mapping of a hugetlbfs file, it is possible to
unmap pages via ftruncate or fallocate hole punch.  If subsequent faults
repopulate these mappings, the reserve counts will go negative.  This is
because the code currently assumes all faults to private mappings will
consume reserves.  The problem can be recreated as follows:

- mmap(MAP_PRIVATE) a file in hugetlbfs filesystem
- write fault in pages in the mapping
- fallocate(FALLOC_FL_PUNCH_HOLE) some pages in the mapping
- write fault in pages in the hole

This will result in negative huge page reserve counts and negative
subpool usage counts for the hugetlbfs.  Note that this can also be
recreated with ftruncate, but fallocate is more straight forward.

This patch modifies the routines vma_needs_reserves and vma_has_reserves
to examine the reserve map associated with private mappings similar to
that for shared mappings.  However, the reserve map semantics for
private and shared mappings are very different.  This results in subtly
different code that is explained in the comments.

Link: http://lkml.kernel.org/r/1464720957-15698-1-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 67961f9db8c477026ea20ce05761bde6f8bf85b0)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: don't pin the user memory in userfaultfd_file_create()

Orabug: 21685254

userfaultfd_file_create() increments mm->mm_users; this means that the
memory won't be unmapped/freed if mm owner exits/execs, and UFFDIO_COPY
after that can populate the orphaned mm more.

Change userfaultfd_file_create() and userfaultfd_ctx_put() to use
mm->mm_count to pin mm_struct. This means that
atomic_inc_not_zero(mm->mm_users) is needed when we are going to
actually play with this memory. Except handle_userfault() path doesn't
need this, the caller must already have a reference.

The patch adds the new trivial helper, mmget_not_zero(), it can have
more users.

Link: http://lkml.kernel.org/r/20160516172254.GA8595@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit d2005e3f41d4f9299e2df6a967c8beb5086967a9)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: don't block on the last VM updates at exit time

Orabug: 21685254

The exit path will do some final updates to the VM of an exiting process
to inform others of the fact that the process is going away.

That happens, for example, for robust futex state cleanup, but also if
the parent has asked for a TID update when the process exits (we clear
the child tid field in user space).

However, at the time we do those final VM accesses, we've already
stopped accepting signals, so the usual "stop waiting for userfaults on
signal" code in fs/userfaultfd.c no longer works, and the process can
become an unkillable zombie waiting for something that will never
happen.

To solve this, just make handle_userfault() abort any user fault
handling if we're already in the exit path past the signal handling
state being dead (marked by PF_EXITING).

This VM special case is pretty ugly, and it is possible that we should
look at finalizing signals later (or move the VM final accesses
earlier). But in the meantime this is a fairly minimally intrusive fix.

Reported-and-tested-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 39680f50ae54cbbb6e72ac38b8329dd3eb9105f4)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

sparc: add waitfd to 32 bit system call tables

Orabug: 21685254

When the waitfd system call was added to UEK, it was only added to the
64 bit system call table. As a result, you can not add a new system
call to the end of the 32 and 64 bit system call tables as they will
not have the same system call number.

Add waitfd to the 32 bit system call tables, so that a new system call
can be added to all tables.

Fixes: 91352d1f (dtrace: add support for sparc64 1of3)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: remove kernel header include from uapi header

Orabug: 21685254

As include/uapi/linux/userfaultfd.h is a user visible header file, it
should not include kernel-exclusive header files.

So trying to build the userfaultfd test program from the selftests
directory fails, since it contains a reference to linux/compiler.h. As
it turns out, that header is not really needed there, so we can simply
remove it to fix that issue.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 9ff42d10c3b3e26d9555878f31b9a2e5c24efa57)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: register uapi generic syscall (aarch64)

Orabug: 21685254

Add the userfaultfd syscalls to uapi asm-generic, it was tested with
postcopy live migration on aarch64 with both 4k and 64k pagesize
kernels.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 09f7298100ea9767324298ab0c7979f6d7463183)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Conflicts:
include/uapi/asm-generic/unistd.h

Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: selftest: don't error out if pthread_mutex_t isn't identical

Orabug: 21685254

On ppc big endian this check fails, the mutex doesn't necessarily need
to be identical for all pages after pthread_mutex_lock/unlock cycles.
The count verification (outside of the pthread_mutex_t structure)
suffices and that is retained.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Thierry Reding <treding@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 5dd01be14565df814408327971775f36e55bf5e3)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: selftest: return an error if BOUNCE_VERIFY fails

Orabug: 21685254

This will report the error in the exit code, in addition of the fprintf.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Thierry Reding <treding@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit a5932bf5737f0b5caf6deaa92b062e4fe66cf5b2)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: selftest: avoid my_bcmp false positives with powerpc

Orabug: 21685254

Keep a non-zero placeholder after the count, for the my_bcmp comparison
of the page against the zeropage. The lockless increment between 255 to
256 against a lockless my_bcmp could otherwise return false positives on
ppc32le.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Thierry Reding <treding@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1f5fee2cf232f9fac05b65f21107d2cf3c32092c)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: selftest: only warn if __NR_userfaultfd is undefined

Orabug: 21685254

If __NR_userfaultfd is not yet defined by the arch, warn but still build
and run the userfaultfd selftest successfully.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Thierry Reding <treding@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 56ed8f169e225dce1f9e40f6eee2e2dabe7d06fc)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: selftest: headers fixup

Orabug: 21685254

Depend on "make headers_install" to create proper headers to include and
provide syscall numbers.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Thierry Reding <treding@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 67f6a029b2ccf3399783a0ff2f812666f290d94f)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Conflicts:
tools/testing/selftests/vm/userfaultfd.c

Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: selftests: vm: pick up sanitized kernel headers

Orabug: 21685254

Add the usr/include subdirectory of the top-level tree to the include
path, and make sure to include headers without relative paths to make
sure the sanitized headers get picked up. Otherwise the compiler will
not be able to find the linux/compiler.h header included by the non-
sanitized include/uapi/linux/userfaultfd.h.

While at it, make sure to only hardcode the syscall numbers on x86 and
PowerPC if they haven't been properly picked up from the headers.

Signed-off-by: Thierry Reding <treding@nvidia.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit d0a871141d07929b559f5eae9c3fc4b63d16866b)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Conflicts:
tools/testing/selftests/vm/Makefile

Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: add missing mmput() in error path

Orabug: 21685254

This fixes a memleak if anon_inode_getfile() fails in userfaultfd().

Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c03e946fdd653c4a23e242aca83da7e9838f5b00)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

dax: revert userfaultfd change

Orabug: 21685254

Undo the change which "userfaultfd: call handle_userfault() for
userfaultfd_missing() faults" made to set_huge_zero_page(). DAX will
need that return value.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 7c414164593514f76b422faae0824bdd3754209b)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

selftests/userfaultfd: fix compiler warnings on 32-bit

Orabug: 21685254

On 32-bit:

    userfaultfd.c: In function 'locking_thread':
    userfaultfd.c:152: warning: left shift count >= width of type
    userfaultfd.c: In function 'uffd_poll_thread':
    userfaultfd.c:295: warning: cast to pointer from integer of different size
    userfaultfd.c: In function 'uffd_read_thread':
    userfaultfd.c:332: warning: cast to pointer from integer of different size

Fix the shift warning by splitting the shift in two parts, and the
integer/pointer warnigns by adding intermediate casts to "unsigned long".

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit af8713b701a74c3784ce6683f64f474a94b1b643)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: selftest: update userfaultfd x86 32bit syscall number

Orabug: 21685254

It changed as result of other syscalls, and while the system call list
itself was correctly updated, the selftest program was not.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 49df2e3e902e1c3caf998f97a92512424936199d)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: selftest

Orabug: 21685254

This test allocates two virtual areas and bounces the physical memory
across the two virtual areas using only userfaultfd.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c47174fc362a089b1125174258e53ef4a69ce6b8)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: avoid missing wakeups during refile in userfaultfd_read

Orabug: 21685254

During the refile in userfaultfd_read both waitqueues could look empty to
the lockless wake_userfault(). Use a seqcount to prevent this false
negative that could leave an userfault blocked.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 2c5b7e1be74ff0175dedbbd325abe9f0dbbb09ae)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: propagate the full address in THP faults

Orabug: 21685254

The THP faults were not propagating the original fault address. The
latest version of the API with uffd.arg.pagefault.address is supposed to
propagate the full address through THP faults.

This was not a kernel crashing bug and it wouldn't risk to corrupt user
memory, but it would cause a SIGBUS failure because the wrong page was
being copied.

For various reasons this wasn't easily reproducible in the qemu workload,
but the strestest exposed the problem immediately.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 230c92a8797e0e717c6732de0fffdd5726c0f48f)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: allow signals to interrupt a userfault

Orabug: 21685254

This is only simple to achieve if the userfault is going to return to
userland (not to the kernel) because we can avoid returning VM_FAULT_RETRY
despite we temporarily released the mmap_sem. The fault would just be
retried by userland then. This is safe at least on x86 and powerpc (the
two archs with the syscall implemented so far).

Hint to verify for which archs this is safe: after handle_mm_fault
returns, no access to data structures protected by the mmap_sem must be
done by the fault code in arch/*/mm/fault.c until up_read(&mm->mmap_sem)
is called.

This has two main benefits: signals can run with lower latency in
production (signals aren't blocked by userfaults and userfaults are
immediately repeated after signal processing) and gdb can then trivially
debug the threads blocked in this kind of userfaults coming directly from
userland.

On a side note: while gdb has a need to get signal processed, coredumps
always worked perfectly with userfaults, no matter if the userfault is
triggered by GUP a kernel copy_user or directly from userland.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit dfa37dc3fc1f6f81a6900d0e561c02362f4817f6)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: require UFFDIO_API before other ioctls

Orabug: 21685254

UFFDIO_API was already forced before read/poll could work. This makes the
code more strict to force it also for all other ioctls.

All users would already have been required to call UFFDIO_API before
invoking other ioctls but this makes it more explicit.

This will ensure we can change all ioctls (all but UFFDIO_API/struct
uffdio_api) with a bump of uffdio_api.api.

There's no actual plan or need to change the API or the ioctl, the current
API already should cover fine even the non cooperative usage, but this is
just for the longer term future just in case.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e6485a47b758cae04a496764a1095961ee3249e4)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE

Orabug: 21685254

These two ioctl allows to either atomically copy or to map zeropages
into the virtual address space. This is used by the thread that opened
the userfaultfd to resolve the userfaults.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit ad465cae96b456b48d26c96f27a0577ba443472a)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: avoid mmap_sem read recursion in mcopy_atomic

Orabug: 21685254

If the rwsem starves writers it wasn't strictly a bug but lockdep
doesn't like it and this avoids depending on lowlevel implementation
details of the lock.

[akpm@linux-foundation.org: delete weird BUILD_BUG_ON()]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b6ebaedb4cb1a18220ae626c3a9e184ee39dd248)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation

Orabug: 21685254

This implements mcopy_atomic and mfill_zeropage that are the lowlevel
VM methods that are invoked respectively by the UFFDIO_COPY and
UFFDIO_ZEROPAGE userfaultfd commands.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c1a4de99fada21e2e9251e52cbb51eff5aadc757)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI

Orabug: 21685254

This implements the uABI of UFFDIO_COPY and UFFDIO_ZEROPAGE.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1f1c6f075904c241f9e44eb37efa8777141fc938)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: activate syscall

Orabug: 21685254

This activates the userfaultfd syscall.

[sfr@canb.auug.org.au: activate syscall fix]
[akpm@linux-foundation.org: don't enable userfaultfd on powerpc]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1380fca084743fef8d17e59b273473393944ce58)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Conflicts:
arch/x86/entry/syscalls/syscall_32.tbl
arch/x86/entry/syscalls/syscall_64.tbl

Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: buildsystem activation

Orabug: 21685254

This allows to select the userfaultfd during configuration to build it.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit a14c151e567cb2c3e62611da808a8bdab86fdee5)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Conflicts:
fs/Makefile

Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read

Orabug: 21685254

Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.

Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.

Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.

Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.

Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).

    vcpu                background_thr userfault_thr
    -----               -----          -----
    vcpu0 handle_mm_fault()

                        postcopy_place_page
                        read old_state -> MISSING
                        UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)

    vcpu0 fault at 0x7fb76a139000 enters handle_userfault
    poll() is kicked

                                        poll() -> POLLIN
                                        read() -> 0x7fb76a139000
                                        postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED

                        tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
                        /* check that no userfault raced with UFFDIO_COPY */
                        if (old_state == MISSING && tmp_state == REQUESTED)
                                UFFDIO_WAKE from background thread

And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:

    vcpu                background_thr userfault_thr
    -----               -----          -----
    vcpu0 handle_mm_fault()

                        postcopy_place_page
                        read old_state -> MISSING
                        UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
                        tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED

    vcpu0 fault at 0x7fb76a139000 enters handle_userfault
    poll() is kicked

                                        poll() -> POLLIN
                                        read() -> 0x7fb76a139000

                                        if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
                                                UFFDIO_WAKE from userfault thread

This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8d2afd96c20316d112e04d935d9e09150e988397)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: allocate the userfaultfd_ctx cacheline aligned

Orabug: 21685254

Use proper slab to guarantee alignment.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 3004ec9cabf49f43fae2b2bd1855a4720f1def7a)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: optimize read() and poll() to be O(1)

Orabug: 21685254

This makes read O(1) and poll that was already O(1) becomes lockless.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 15b726ef048b31a24b3fefb6863083a25fe34800)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: wake pending userfaults

Orabug: 21685254

This is an optimization but it's a userland visible one and it affects
the API.

The downside of this optimization is that if you call poll() and you
get POLLIN, read(ufd) may still return -EAGAIN. The blocked userfault
may be waken by a different thread, before read(ufd) comes
around. This in short means that poll() isn't really usable if the
userfaultfd is opened in blocking mode.

userfaults won't wait in "pending" state to be read anymore and any
UFFDIO_WAKE or similar operations that has the objective of waking
userfaults after their resolution, will wake all blocked userfaults
for the resolved range, including those that haven't been read() by
userland yet.

The behavior of poll() becomes not standard, but this obviates the
need of "spurious" UFFDIO_WAKE and it lets the userland threads to
restart immediately without requiring an UFFDIO_WAKE. This is even
more significant in case of repeated faults on the same address from
multiple threads.

This optimization is justified by the measurement that the number of
spurious UFFDIO_WAKE accounts for 5% and 10% of the total
userfaults for heavy workloads, so it's worth optimizing those away.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit ba85c702e4b247393ffe9e3fbc13d8aee7b02059)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: change the read API to return a uffd_msg

Orabug: 21685254

I had requests to return the full address (not the page aligned one) to
userland.

It's not entirely clear how the page offset could be relevant because
userfaults aren't like SIGBUS that can sigjump to a different place and it
actually skip resolving the fault depending on a page offset.  There's
currently no real way to skip the fault especially because after a
UFFDIO_COPY|ZEROPAGE, the fault is optimized to be retried within the
kernel without having to return to userland first (not even self modifying
code replacing the .text that touched the faulting address would prevent
the fault to be repeated).  Userland cannot skip repeating the fault even
more so if the fault was triggered by a KVM secondary page fault or any
get_user_pages or any copy-user inside some syscall which will return to
kernel code.  The second time FAULT_FLAG_RETRY_NOWAIT won't be set leading
to a SIGBUS being raised because the userfault can't wait if it cannot
release the mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for
that).

Still returning userland a proper structure during the read() on the uffd,
can allow to use the current UFFD_API for the future non-cooperative
extensions too and it looks cleaner as well.  Once we get additional
fields there's no point to return the fault address page aligned anymore
to reuse the bits below PAGE_SHIFT.

The only downside is that the read() syscall will read 32bytes instead of
8bytes but that's not going to be measurable overhead.

The total number of new events that can be extended or of new future bits
for already shipped events, is limited to 64 by the features field of the
uffdio_api structure.  If more will be needed a bump of UFFD_API will be
required.

[akpm@linux-foundation.org: use __packed]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit a9b85f9415fd9e529d03299e5335433f614ec1fb)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: Rename uffd_api.bits into .features

Orabug: 21685254

This is (seems to be) the minimal thing that is required to unblock
standard uffd usage from the non-cooperative one. Now more bits can be
added to the features field indicating e.g. UFFD_FEATURE_FORK and others
needed for the latter use-case.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 3f602d2724b1f7d2d27ddcd7963a040a5890fd16)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: add new syscall to provide memory externalization

Orabug: 21685254

Once an userfaultfd has been created and certain region of the process
virtual address space have been registered into it, the thread responsible
for doing the memory externalization can manage the page faults in
userland by talking to the kernel using the userfaultfd protocol.

poll() can be used to know when there are new pending userfaults to be
read (POLLIN).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 86039bd3b4e6a1129318cbfed4e0a6e001656635)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx

Orabug: 21685254

vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
must be aware about so that we can merge vmas back like they were
originally before arming the userfaultfd on some memory range.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 19a809afe2fe089317226bbe5c5a1ce7f53dcdca)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: call handle_userfault() for userfaultfd_missing() faults

Orabug: 21685254

This is where the page faults must be modified to call
handle_userfault() if userfaultfd_missing() is true (so if the
vma->vm_flags had VM_UFFD_MISSING set).

handle_userfault() then takes care of blocking the page fault and
delivering it to userland.

The fault flags must also be passed as parameter so the "read|write"
kind of fault can be passed to userland.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 6b251fc96cf2cdf1ce4b5db055547e2a5679bc77)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP

Orabug: 21685254

These two flags gets set in vma->vm_flags to tell the VM common code
if the userfaultfd is armed and in which mode (only tracking missing
faults, only tracking wrprotect faults or both). If neither flags is
set it means the userfaultfd is not armed on the vma.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 16ba6f811dfe44bc14f7946a4b257b85476fc16e)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct

Orabug: 21685254

This adds the vm_userfaultfd_ctx to the vm_area_struct.

Oracle specific changes to maintain KABI also made.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 745f234be12b6191b15eae8dd415cc81a9137f47)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Conflicts:
include/linux/mm_types.h

userfaultfd: linux/userfaultfd_k.h

Orabug: 21685254

Kernel header defining the methods needed by the VM common code to
interact with the userfaultfd.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 932b18e0aec65acb089f4bd8761ee85e70f8eb6a)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: uAPI

Orabug: 21685254

Defines the uAPI of the userfaultfd, notably the ioctl numbers and protocol.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1038628d80e96e3a086189172d9be8eb85ecfabf)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

userfaultfd: linux/Documentation/vm/userfaultfd.txt

Orabug: 21685254

This is the latest userfaultfd patchset.  The postcopy live migration
feature on the qemu side is mostly ready to be merged and it entirely
depends on the userfaultfd syscall to be merged as well.  So it'd be great
if this patchset could be reviewed for merging in -mm.

Userfaults allow to implement on demand paging from userland and more
generally they allow userland to more efficiently take control of the
behavior of page faults than what was available before (PROT_NONE +
SIGSEGV trap).

The use cases are:

1) KVM postcopy live migration (one form of cloud memory
   externalization).

   KVM postcopy live migration is the primary driver of this work:

    http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/
    http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html

2) postcopy live migration of binaries inside linux containers:

    http://thread.gmane.org/gmane.linux.kernel.mm/132662

3) KVM postcopy live snapshotting (allowing to limit/throttle the
   memory usage, unlike fork would, plus the avoidance of fork
   overhead in the first place).

   While the wrprotect tracking is not implemented yet, the syscall API is
   already contemplating the wrprotect fault tracking and it's generic enough
   to allow its later implementation in a backwards compatible fashion.

4) KVM userfaults on shared memory. The UFFDIO_COPY lowlevel method
   should be extended to work also on tmpfs and then the
   uffdio_register.ioctls will notify userland that UFFDIO_COPY is
   available even when the registered virtual memory range is tmpfs
   backed.

5) alternate mechanism to notify web browsers or apps on embedded
   devices that volatile pages have been reclaimed. This basically
   avoids the need to run a syscall before the app can access with the
   CPU the virtual regions marked volatile. This depends on point 4)
   to be fulfilled first, as volatile pages happily apply to tmpfs.

Even though there wasn't a real use case requesting it yet, it also
allows to implement distributed shared memory in a way that readonly
shared mappings can exist simultaneously in different hosts and they
can be become exclusive at the first wrprotect fault.

This patch (of 22):

Add documentation.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 25edd8bffd0f7563f0c04c1d219eb89061ce9886)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

mm/hugetlbfs: unmap pages if page fault raced with hole punch

Orabug: 21685254

Page faults can race with fallocate hole punch.  If a page fault happens
between the unmap and remove operations, the page is not removed and
remains within the hole.  This is not the desired behavior.  The race is
difficult to detect in user level code as even in the non-race case, a
page within the hole could be faulted back in before fallocate returns.
If userfaultfd is expanded to support hugetlbfs in the future, this race
will be easier to observe.

If this race is detected and a page is mapped, the remove operation
(remove_inode_hugepages) will unmap the page before removing.  The unmap
within remove_inode_hugepages occurs with the hugetlb_fault_mutex held
so that no other faults will be processed until the page is removed.

The (unmodified) routine hugetlb_vmdelete_list was moved ahead of
remove_inode_hugepages to satisfy the new reference.

[akpm@linux-foundation.org: move hugetlb_vmdelete_list()]
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 4aae8d1c051ea00b456da6811bc36d1f69de5445)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

Merge branch topic/uek-4.1/rpm-build of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/rpm-build:
  perf: build TUI by default by pulling in slang and linking it statically
  configs: ol6: set with_headers, with_dtrace defaults to 0
  smartpqi: enable driver in uek config files
  Add the CONFIG_DEBUG_SET_MODULE_RONX option to OL6

Merge branch topic/uek-4.1/pmem of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/pmem: (222 commits)
  libnvdimm, dax: record the specified alignment of a dax-device instance
  libnvdimm, dax: reserve space to store labels for device-dax
  libnvdimm, dax: introduce device-dax infrastructure
  libnvdimm: cleanup nvdimm_namespace_common_probe(), kill 'host'
  mm, dax: fix livelock, allow dax pmd mappings to become writeable
  dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
  tools/testing/libnvdimm: cleanup mock resource lookup
  block: protect rw_page against device teardown
  fix kABI breakage caused by "block: generic request_queue reference counting"
  block: generic request_queue reference counting
  fix kABI breakage caused by "block: use an atomic_t for mq_freeze_depth"
  block: use an atomic_t for mq_freeze_depth
  dax: guarantee page aligned results from bdev_direct_access()
  dax: increase granularity of dax_clear_blocks() operations
  pmem, dax: clean up clear_pmem()
  xfs: fix recursive splice read locking with DAX
  xfs: per-filesystem stats counter implementation
  xfs: per-filesystem stats in sysfs
  xfs: pass xfsstats structures to handlers and macros
  xfs: consolidate sysfs ops
  ...

Merge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/ofed:
  RDS: don't commit to queue till transport connection is up
  RDS: restrict socket connection reset to CAP_NET_ADMIN
  xsigo: Fix crash in accessing xve proc l2 entries
  xsigo: Fix race in freeing aged Forwarding table entry
  xsigo: Schedule while uninterruptible

Merge branch 'topic/uek-4.1/drivers' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/drivers:
  NVMe: reverse IO direction for VUC command code F7
  Call i40e_client_get_params only after the instance is checked
  scsi: smartpqi: raid bypass lba calculation fix
  scsi: smartpqi: bump driver version
  scsi: smartpqi: add smartpqi.txt
  scsi: smartpqi: update Kconfig
  scsi: smartpqi: remove timeout for cache flush operations
  scsi: smartpqi: scsi queuecommand cleanup
  scsi: smartpqi: minor tweaks to update time support
  scsi: smartpqi: minor function reformating
  scsi: smartpqi: correct event acknowledgment timeout issue
  scsi: smartpqi: correct controller offline issue
  scsi: smartpqi: add kdump support
  scsi: smartpqi: enhance reset logic
  scsi: smartpqi: enhance drive offline informational message
  scsi: smartpqi: simplify spanning
  scsi: smartpqi: change tmf macro names
  scsi: smartpqi: change aio sg processing
  aacraid: remove wildcard for series 9 controllers
  smartpqi: initial commit of Microsemi smartpqi driver

Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/upstream-cherry-picks:
  xfs: validate metadata LSNs against log on v5 superblocks
  IB/ipoib: move back IB LL address into the hard header
  net: preserve IP control block during GSO segmentation
  xfs: fix broken multi-fsb buffer logging
  xfs: Split default quota limits by quota type

Merge branch topic/uek-4.1/stable-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/stable-cherry-picks:
  crypto: skcipher - Fix blkcipher walk OOM crash
  crypto: cryptd - initialize child shash_desc on import
  crypto: scatterwalk - Fix test in scatterwalk_done
  crypto: gcm - Filter out async ghash if necessary
  PKCS#7: pkcs7_validate_trust(): initialize the _trusted output argument
  crypto: public_key: select CRYPTO_AKCIPHER
  crypto: hash - Fix page length clamping in hash walk

perf: build TUI by default by pulling in slang and linking it statically

Orabug: 25161079

xfs: validate metadata LSNs against log on v5 superblocks

From a45086e27dfa21a4b39134f7505c8f60a3ecdec4 Mon Sep 17 00:00:00 2001

Since the onset of v5 superblocks, the LSN of the last modification has
been included in a variety of on-disk data structures. This LSN is used
to provide log recovery ordering guarantees (e.g., to ensure an older
log recovery item is not replayed over a newer target data structure).

While this works correctly from the point a filesystem is formatted and
mounted, userspace tools have some problematic behaviors that defeat
this mechanism. For example, xfs_repair historically zeroes out the log
unconditionally (regardless of whether corruption is detected). If this
occurs, the LSN of the filesystem is reset and the log is now in a
problematic state with respect to on-disk metadata structures that might
have a larger LSN. Until either the log catches up to the highest
previously used metadata LSN or each affected data structure is modified
and written out without incident (which resets the metadata LSN), log
recovery is susceptible to filesystem corruption.

This problem is ultimately addressed and repaired in the associated
userspace tools. The kernel is still responsible to detect the problem
and notify the user that something is wrong. Check the superblock LSN at
mount time and fail the mount if it is invalid. From that point on,
trigger verifier failure on any metadata I/O where an invalid LSN is
detected. This results in a filesystem shutdown and guarantees that we
do not log metadata changes with invalid LSNs on disk. Since this is a
known issue with a known recovery path, present a warning to instruct
the user how to recover.

Orabug: 25062171

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

IB/ipoib: move back IB LL address into the hard header

Orabug: 24469379

[ Backport of upstream commit fc791b633515 ]

After the commit 9207f9d45b0a ("net: preserve IP control block
during GSO segmentation"), the GSO CB and the IPoIB CB conflict.
That destroy the IPoIB address information cached there,
causing a severe performance regression, as better described here:

http://marc.info/?l=linux-kernel&m=146787279825501&w=2

This change moves the data cached by the IPoIB driver from the
skb control lock into the IPoIB hard header, as done before
the commit 936d7de3d736 ("IPoIB: Stop lying about hard_header_len
and use skb->cb to stash LL addresses").
In order to avoid GRO issue, on packet reception, the IPoIB driver
stash into the skb a dummy pseudo header, so that the received
packets have actually a hard header matching the declared length.
To avoid changing the connected mode maximum mtu, the allocated
head buffer size is increased by the pseudo header length.

After this commit, IPoIB performances are back to pre-regression
value.

v2 -> v3: rebased
v1 -> v2: avoid changing the max mtu, increasing the head buf size

Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>

net: preserve IP control block during GSO segmentation

Orabug: 24469379

[ Upstream commit 9207f9d45b0ad071baa128e846d7e7ed85016df3 ]

Skb_gso_segment() uses skb control block during segmentation.
This patch adds 32-bytes room for previous control block which
will be copied into all resulting segments.

This patch fixes kernel crash during fragmenting forwarded packets.
Fragmentation requires valid IP CB in skb for clearing ip options.
Also patch removes custom save/restore in ovs code, now it's redundant.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Link: http://lkml.kernel.org/r/CALYGNiP-0MZ-FExV2HutTvE9U-QQtkKSoE--KN=JQE5STYsjAA@mail.gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit abefd1b4087b9b5e83e7b4e7689f8b8e3cb2899c)

Signed-off-by: Dan Duval <dan.duval@oracle.com>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>

RDS: don't commit to queue till transport connection is up

A rouge application can flood the send queue by targeting a dead
or non-existing node. Don't commit any messages to the queue
till the legitimate connection to the peer is established.
Let application retry so that only legit connections can
get on to the send queue.

Orabug: 25393611
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: restrict socket connection reset to CAP_NET_ADMIN

Normal users not suppose to need/have access to the transport
connection reset.

Orabug:25393611
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

configs: ol6: set with_headers, with_dtrace defaults to 0

This makes OL6 config match OL7 config for these default %defines. Later
directives conditionally set with_headers if sparc64, and with_dtrace if
sparc64|x86_64, overriding the defaults.

Orabug: 25257401
Signed-off-by: Todd Vierling <todd.vierling@oracle.com>

NVMe: reverse IO direction for VUC command code F7

Orabug: 25258071

Samsung uses D2H command with Vendor Uniq Command (VUC) code F7
(the 0th bit of which is 1) for retrieving memory dump. In UEK4,
Bit 0 of the D2H command code has to be 0. Because of this voilation,
the nvmecli is unable to do crash and memory dumps in UEK4.

As the Samsung firmware can only understand VUC command code F7,
the IO direction is reversed for this vendor command code to
retrieve memory and crash dump.

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>

xfs: fix broken multi-fsb buffer logging

Orabug: 24400444
Upstream-commit: a3916e528b917851a4d2379e2fd2579ad5f2b5a7

Multi-block buffers are logged based on buffer offset in
xfs_trans_log_buf(). xfs_buf_item_log() ultimately walks each mapping in
the buffer and marks the associated range to be logged in the
xfs_buf_log_format bitmap for that mapping. This code is broken,
however, in that it marks the actual buffer offsets of the associated
range in each bitmap rather than shifting to the byte range for that
particular mapping.

For example, on a 4k fsb fs, buffer offset 4096 refers to the first byte
of the second mapping in the buffer. This means byte 0 of the second log
format bitmap should be tagged as dirty. Instead, the current code marks
byte offset 4096 of the second log format bitmap, which is invalid and
potentially out of range of the mapping.

As a result of this, the log item format code invoked at transaction
commit time is not be able to correctly identify what parts of the
buffer to copy into log vectors. This can lead to NULL log vector
pointer dereferences in CIL push context if the item format code was not
able to locate any dirty ranges at all. This crash has been reproduced
on a 4k FSB filesystem using 16k directory blocks where an unlink
operation happened not to log anything in the first block of the
mapping. The logged offsets were all over 4k, marked as such in the
subsequent log format mappings, and thus left the transaction with an
xfs_log_item that is marked DIRTY but without any logged regions.

Further, even when the logged regions are marked correctly in the buffer
log format bitmaps, the format code doesn't copy the correct ranges of
the buffer into the log. This means that any logged region beyond the
first block of a multi-block buffer is subject to corruption after a
crash and log recovery sequence. This is due to a failure to convert the
mapping bm_len field from basic blocks to bytes in the buffer offset
tracking code in xfs_buf_item_format().

Update xfs_buf_item_log() to convert buffer offsets to segment relative
offsets when logging multi-block buffers. This ensures that the modified
regions of a buffer are logged correctly and avoids the aforementioned
crash. Also update xfs_buf_item_format() to correctly track the source
offset into the buffer for the log vector formatting code. This ensures
that the correct data is copied into the log.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xfs: Split default quota limits by quota type

Orabug: 24399524
Upstream-commit: be6079461abf796e29d02b450a16908f4bf58f6c

Default quotas are globally set due historical reasons. IRIX only
supported user and project quotas, and default quota was only
applied to user quotas.

In Linux, when a default quota is set, all different quota types
inherits the same default value.

An user with a quota limit larger than the default quota value, will
still be limited to the default value because the group quotas also
inherits the default quotas. Unless the group which the user belongs
to have a custom quota limit set.

This patch aims to split the default quota value by quota type.
Allowing each quota type having different default values.

Default time limits are still set globally. XFS does not set a
per-user/group timer, but a single global timer. For changing this
behavior, some changes should be made in user-space tools another
bugs being fixed.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

xsigo: Fix crash in accessing xve proc l2 entries

Orabug: 25165085

When accessing l2tables using /proc/driver/xve/devices/
system panics if path is created and there is no associated
Transmit QP. Added a check for validating tx structure before
using it.

Reported-by: Jie zhu <jie.x.zhu@oracle.com>
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Reviewed-by: sajid zia <szia@oracle.com>

Call i40e_client_get_params only after the instance is checked

We can avoid the minor bit of work by calling check params after we
check for the client instance, since we're about to return early in
cases where we do not have a client. But, more importantly, this
change prevents a panic from a NULL pointer when attempting to
get the client params.

Orabug: 25159384

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>

smartpqi: enable driver in uek config files

Orabug: 25144431

Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: smartpqi: raid bypass lba calculation fix

Orabug: 25144431

In the ioaccel path, the calculation of the starting LBA for
READ(6)/WRITE(6) SCSI commands does not take into account the most
significant 5 bits of the LBA: it only uses the least significant 16
bits of the starting LBA.

Reported-by: Mahesh Rajashekhara <mahesh.rajashekhara@microsemi.com>
Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit e018ef572ba4ff17caa9e82d5e1b5cea0d76f903)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: smartpqi: bump driver version

Orabug: 25144431

[mkp: fixed typo]

Reviewed-by: Scott Teel <scott.teel@microsemi.com>
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 699bed758b1313f97a5ac78848090e8357d12ab1)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: smartpqi: add smartpqi.txt

Orabug: 25144431

added Documentation/scsi/smartpqi.txt

[mkp: applied by hand]

Reviewed-by: Kevin Barnett <kevin.barnett@microsemi.com>
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 425b490b2aa745740ea3618e1cdcc2bc37c0d996)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: smartpqi: update Kconfig

Orabug: 25144431

The aacraid driver will not managage Microsemi smartpqi controllers, but
will still manage older aacraid devices.

Updated help section.

Reviewed-by: Scott Teel <scott.teel@microsemi.com>
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com>
Reviewed-by: Kevin Barnett <kevin.barnett@microsemi.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit e8a31ebae1669f05254430d2fced99d77c63fc10)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: smartpqi: remove timeout for cache flush operations

Orabug: 25144431

Some cache flush operations can take longer than the timeout value. Best
to not impose a time limit to handle all cases.

Reviewed-by: Scott Teel <scott.teel@microsemi.com>
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit d48f8fad1e435eff26c29e8e109c1a50c441e533)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: smartpqi: scsi queuecommand cleanup

Orabug: 25144431

minor cleanup of scsi queue command function

Reviewed-by: Scott Teel <scott.teel@microsemi.com>
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 7d81d2b8714ec72462a99875acbf2f976402f3f1)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: smartpqi: minor tweaks to update time support

Orabug: 25144431

minor tweaks to update time support

Reviewed-by: Scott Teel <scott.teel@microsemi.com>
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 4fbebf1a779d9f6890ddc1df90c497b161dfb34c)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: smartpqi: minor function reformating

Orabug: 25144431

reformatted pqi_num_elements_free() to match the rest of the driver

Reviewed-by: Scott Teel <scott.teel@microsemi.com>
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit df7a1fcfc4761e658b60739e2ff4cd148afcae89)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: smartpqi: correct event acknowledgment timeout issue

Orabug: 25144431

the driver no longer waits for the firmware to consume
the event ack IU.

Reviewed-by: Scott Teel <scott.teel@microsemi.com>
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 5e6429df9c8b3ab9e0a8d18af5248692ebe41871)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: smartpqi: correct controller offline issue

Orabug: 25144431

Fixes: 6c223761e 'smartpqi: initial commit of Microsemi smartpqi driver'
Fixed a bug where the driver would not free all of the
controller resources if the controller ever went offline.

Reviewed-by: Scott Teel <scott.teel@microsemi.com>
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit e57a1f9b2fa4326ec289f1d03c658184ed6addb8)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: smartpqi: add kdump support

Orabug: 25144431

Reviewed-by: Scott Teel <scott.teel@microsemi.com>
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit ff6abb7383d2eec6c8c988ff661352e66f245686)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: smartpqi: enhance reset logic

Orabug: 25144431

Eliminated timeout from LUN reset logic.

Reviewed-by: Scott Teel <scott.teel@microsemi.com>
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 14bb215d09de98a8e95fa2bb1b8f35b79672c5df)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: smartpqi: enhance drive offline informational message

Orabug: 25144431

Made a couple of error messages more verbose.

Reviewed-by: Scott Teel <scott.teel@microsemi.com>
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit e58081a714275d1490e470bdaf1b5dc23043cf2a)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: smartpqi: simplify spanning

Orabug: 25144431

Removed the workaround for the transition to spanning.

Reviewed-by: Scott Teel <scott.teel@microsemi.com>
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 77668f412dbcd6a9dd04c92f0b170c5b5182a5fb)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: smartpqi: change tmf macro names

Orabug: 25144431

small change to make code look cleaner

Reviewed-by: Scott Teel <scott.teel@microsemi.com>
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit b17f048658c9b1bc8ac1d9a54b223f740c70f8fd)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: smartpqi: change aio sg processing

Orabug: 25144431

Take advantage of controller improvements.

Reviewed-by: Scott Teel <scott.teel@microsemi.com>
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit a60eec0251fe1bfc0cd549c073591e6657761158)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

aacraid: remove wildcard for series 9 controllers

Orabug: 25144431

Depends on smartpqi driver adoption

Reviewed-by: Scott Teel <scott.teel@microsemi.com>
Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>

smartpqi: initial commit of Microsemi smartpqi driver

Orabug: 25144431

This initial commit contains Microsemi's smartpqi module.

[mkp: Minor tweaks to apply to 4.9/scsi-queue]

Reviewed-by: Scott Benesh <scott.benesh@microsemi.com>
Reviewed-by: Kevin Barnett <kevin.barnett@microsemi.com>
Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 6c223761eb5482dca2bd981d0a800c4aba3c9009)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xsigo: Fix race in freeing aged Forwarding table entry

Orabug: 25129729

Fixed a race in case of freeing an Aged Forwarding entry
When there is no path associated with a Forwarding entry and
if it ages out uVNIC driver starts aging it ,if a new path with
different Forwarding entry (different mac but same remote guid
and qpn) creates a path present code ends up deleting
that newly created path.

Added a variable del_in_progress to detect and avoid race if
forwarding table is accessed during deletion.

Added protection in one case, while accessing forwarding table entry.
Cleaned up unwanted messages

Reported-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Reviewed-by: sajid zia <szia@oracle.com>

xsigo: Schedule while uninterruptible

Orabug: 25097469

Fix the case where uVNIC driver calls msleep while
holding spinlock in case path creation failure.

Reported-by: Haakon bugge <haakon.bugge@oracle.com>
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Reviewed-by: sajid zia <szia@oracle.com>

Add the CONFIG_DEBUG_SET_MODULE_RONX option to OL6

Set the CONFIG_DEBUG_SET_MODULE_RONX option in the OL6 configuration,
this changes the page permissions of a loadable module, code and RO
data are set to RX, leaving off the write permission for the page,
this causes modify access to be trapped and provides enhanced
security.

Orabug: 24910950
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>

Merge branch topic/uek-4.1/rpm-build of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/rpm-build:
Enable config options for IEEE 802.1AE driver

Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/upstream-cherry-picks: (113 commits)
  packet: fix race condition in packet_set_ring
  net: avoid signed overflows for SO_{SND|RCV}BUFFORCE
  ALSA: pcm : Call kill_fasync() in stream lock
  netlink: Fix dump skb leak/double free
  rcu: Fix soft lockup for rcu_nocb_kthread
  mpi: Fix NULL ptr dereference in mpi_powm() [ver #3]
  sctp: validate chunk len before actually using it
  kvm: raise KVM_SOFT_MAX_VCPUS to support more vcpus
  netfilter: nfnetlink: fix splat due to incorrect socket memory accounting in skbuff clones
  netfilter: nfnetlink: avoid recurrent netns lookups in call_batch
  netfilter: nf_tables: fix wrong destroy anonymous sets if binding fails
  netfilter: nf_tables: use reverse traversal commit_list in nf_tables_abort
  ixgbevf: Handle previously-freed msix_entries
  PCI: pciehp: Prioritize data-link event over presence detect
  PCI: pciehp: Leave power indicator on when enabling already-enabled slot
  net: Fix use after free in the recvmmsg exit path
  signals: avoid unnecessary taking of sighand->siglock
  audit: fix a double fetch in audit_log_single_execve_arg()
  KEYS: Fix short sprintf buffer in /proc/keys show function
  tools/power turbostat: Replace MSR_NHM_TURBO_RATIO_LIMIT
  ...

packet: fix race condition in packet_set_ring

When packet_set_ring creates a ring buffer it will initialize a
struct timer_list if the packet version is TPACKET_V3. This value
can then be raced by a different thread calling setsockopt to
set the version to TPACKET_V1 before packet_set_ring has finished.

This leads to a use-after-free on a function pointer in the
struct timer_list when the socket is closed as the previously
initialized timer will not be deleted.

The bug is fixed by taking lock_sock(sk) in packet_setsockopt when
changing the packet version while also taking the lock at the start
of packet_set_ring.

Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
Signed-off-by: Philip Pettersson <philip.pettersson@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 84ac7260236a49c79eede91617700174c2c19b0c)

Orabug: 25209594
CVE: CVE-2016-8655
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>