]> www.infradead.org Git - users/jedix/linux-maple.git/log
users/jedix/linux-maple.git
8 years agouserfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation
Andrea Arcangeli [Fri, 4 Sep 2015 22:47:04 +0000 (15:47 -0700)]
userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation

Orabug: 21685254

This implements mcopy_atomic and mfill_zeropage that are the lowlevel
VM methods that are invoked respectively by the UFFDIO_COPY and
UFFDIO_ZEROPAGE userfaultfd commands.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c1a4de99fada21e2e9251e52cbb51eff5aadc757)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agouserfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI
Andrea Arcangeli [Fri, 4 Sep 2015 22:47:01 +0000 (15:47 -0700)]
userfaultfd: UFFDIO_COPY|UFFDIO_ZEROPAGE uAPI

Orabug: 21685254

This implements the uABI of UFFDIO_COPY and UFFDIO_ZEROPAGE.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1f1c6f075904c241f9e44eb37efa8777141fc938)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agouserfaultfd: activate syscall
Andrea Arcangeli [Fri, 4 Sep 2015 22:46:58 +0000 (15:46 -0700)]
userfaultfd: activate syscall

Orabug: 21685254

This activates the userfaultfd syscall.

[sfr@canb.auug.org.au: activate syscall fix]
[akpm@linux-foundation.org: don't enable userfaultfd on powerpc]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1380fca084743fef8d17e59b273473393944ce58)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Conflicts:
arch/x86/entry/syscalls/syscall_32.tbl
arch/x86/entry/syscalls/syscall_64.tbl

Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agouserfaultfd: buildsystem activation
Andrea Arcangeli [Fri, 4 Sep 2015 22:46:54 +0000 (15:46 -0700)]
userfaultfd: buildsystem activation

Orabug: 21685254

This allows to select the userfaultfd during configuration to build it.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit a14c151e567cb2c3e62611da808a8bdab86fdee5)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Conflicts:
fs/Makefile

Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agouserfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read
Andrea Arcangeli [Fri, 4 Sep 2015 22:46:51 +0000 (15:46 -0700)]
userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read

Orabug: 21685254

Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
userfaultfd_read if they are run on different threads simultaneously.

Until now qemu solved the race in userland: the race was explicitly
and intentionally left for userland to solve. However we can also
solve it in kernel.

Requiring all users to solve this race if they use two threads (one
for the background transfer and one for the userfault reads) isn't
very attractive from an API prospective, furthermore this allows to
remove a whole bunch of mutex and bitmap code from qemu, making it
faster. The cost of __get_user_pages_fast should be insignificant
considering it scales perfectly and the pagetables are already hot in
the CPU cache, compared to the overhead in userland to maintain those
structures.

Applying this patch is backwards compatible with respect to the
userfaultfd userland API, however reverting this change wouldn't be
backwards compatible anymore.

Without this patch qemu in the background transfer thread, has to read
the old state, and do UFFDIO_WAKE if old_state is missing but it
become REQUESTED by the time it tries to set it to RECEIVED (signaling
the other side received an userfault).

    vcpu                background_thr userfault_thr
    -----               -----          -----
    vcpu0 handle_mm_fault()

                        postcopy_place_page
                        read old_state -> MISSING
                        UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)

    vcpu0 fault at 0x7fb76a139000 enters handle_userfault
    poll() is kicked

                                        poll() -> POLLIN
                                        read() -> 0x7fb76a139000
                                        postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED

                        tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
                        /* check that no userfault raced with UFFDIO_COPY */
                        if (old_state == MISSING && tmp_state == REQUESTED)
                                UFFDIO_WAKE from background thread

And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:

    vcpu                background_thr userfault_thr
    -----               -----          -----
    vcpu0 handle_mm_fault()

                        postcopy_place_page
                        read old_state -> MISSING
                        UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
                        tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED

    vcpu0 fault at 0x7fb76a139000 enters handle_userfault
    poll() is kicked

                                        poll() -> POLLIN
                                        read() -> 0x7fb76a139000

                                        if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
                                                UFFDIO_WAKE from userfault thread

This patch removes the need of both UFFDIO_WAKE and of the associated
per-page tristate as well.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8d2afd96c20316d112e04d935d9e09150e988397)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agouserfaultfd: allocate the userfaultfd_ctx cacheline aligned
Andrea Arcangeli [Fri, 4 Sep 2015 22:46:48 +0000 (15:46 -0700)]
userfaultfd: allocate the userfaultfd_ctx cacheline aligned

Orabug: 21685254

Use proper slab to guarantee alignment.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 3004ec9cabf49f43fae2b2bd1855a4720f1def7a)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agouserfaultfd: optimize read() and poll() to be O(1)
Andrea Arcangeli [Fri, 4 Sep 2015 22:46:44 +0000 (15:46 -0700)]
userfaultfd: optimize read() and poll() to be O(1)

Orabug: 21685254

This makes read O(1) and poll that was already O(1) becomes lockless.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 15b726ef048b31a24b3fefb6863083a25fe34800)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agouserfaultfd: wake pending userfaults
Andrea Arcangeli [Fri, 4 Sep 2015 22:46:41 +0000 (15:46 -0700)]
userfaultfd: wake pending userfaults

Orabug: 21685254

This is an optimization but it's a userland visible one and it affects
the API.

The downside of this optimization is that if you call poll() and you
get POLLIN, read(ufd) may still return -EAGAIN. The blocked userfault
may be waken by a different thread, before read(ufd) comes
around. This in short means that poll() isn't really usable if the
userfaultfd is opened in blocking mode.

userfaults won't wait in "pending" state to be read anymore and any
UFFDIO_WAKE or similar operations that has the objective of waking
userfaults after their resolution, will wake all blocked userfaults
for the resolved range, including those that haven't been read() by
userland yet.

The behavior of poll() becomes not standard, but this obviates the
need of "spurious" UFFDIO_WAKE and it lets the userland threads to
restart immediately without requiring an UFFDIO_WAKE. This is even
more significant in case of repeated faults on the same address from
multiple threads.

This optimization is justified by the measurement that the number of
spurious UFFDIO_WAKE accounts for 5% and 10% of the total
userfaults for heavy workloads, so it's worth optimizing those away.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit ba85c702e4b247393ffe9e3fbc13d8aee7b02059)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agouserfaultfd: change the read API to return a uffd_msg
Andrea Arcangeli [Fri, 4 Sep 2015 22:46:37 +0000 (15:46 -0700)]
userfaultfd: change the read API to return a uffd_msg

Orabug: 21685254

I had requests to return the full address (not the page aligned one) to
userland.

It's not entirely clear how the page offset could be relevant because
userfaults aren't like SIGBUS that can sigjump to a different place and it
actually skip resolving the fault depending on a page offset.  There's
currently no real way to skip the fault especially because after a
UFFDIO_COPY|ZEROPAGE, the fault is optimized to be retried within the
kernel without having to return to userland first (not even self modifying
code replacing the .text that touched the faulting address would prevent
the fault to be repeated).  Userland cannot skip repeating the fault even
more so if the fault was triggered by a KVM secondary page fault or any
get_user_pages or any copy-user inside some syscall which will return to
kernel code.  The second time FAULT_FLAG_RETRY_NOWAIT won't be set leading
to a SIGBUS being raised because the userfault can't wait if it cannot
release the mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for
that).

Still returning userland a proper structure during the read() on the uffd,
can allow to use the current UFFD_API for the future non-cooperative
extensions too and it looks cleaner as well.  Once we get additional
fields there's no point to return the fault address page aligned anymore
to reuse the bits below PAGE_SHIFT.

The only downside is that the read() syscall will read 32bytes instead of
8bytes but that's not going to be measurable overhead.

The total number of new events that can be extended or of new future bits
for already shipped events, is limited to 64 by the features field of the
uffdio_api structure.  If more will be needed a bump of UFFD_API will be
required.

[akpm@linux-foundation.org: use __packed]
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit a9b85f9415fd9e529d03299e5335433f614ec1fb)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agouserfaultfd: Rename uffd_api.bits into .features
Pavel Emelyanov [Fri, 4 Sep 2015 22:46:34 +0000 (15:46 -0700)]
userfaultfd: Rename uffd_api.bits into .features

Orabug: 21685254

This is (seems to be) the minimal thing that is required to unblock
standard uffd usage from the non-cooperative one.  Now more bits can be
added to the features field indicating e.g.  UFFD_FEATURE_FORK and others
needed for the latter use-case.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 3f602d2724b1f7d2d27ddcd7963a040a5890fd16)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agouserfaultfd: add new syscall to provide memory externalization
Andrea Arcangeli [Fri, 4 Sep 2015 22:46:31 +0000 (15:46 -0700)]
userfaultfd: add new syscall to provide memory externalization

Orabug: 21685254

Once an userfaultfd has been created and certain region of the process
virtual address space have been registered into it, the thread responsible
for doing the memory externalization can manage the page faults in
userland by talking to the kernel using the userfaultfd protocol.

poll() can be used to know when there are new pending userfaults to be
read (POLLIN).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 86039bd3b4e6a1129318cbfed4e0a6e001656635)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agouserfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx
Andrea Arcangeli [Fri, 4 Sep 2015 22:46:24 +0000 (15:46 -0700)]
userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx

Orabug: 21685254

vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
must be aware about so that we can merge vmas back like they were
originally before arming the userfaultfd on some memory range.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 19a809afe2fe089317226bbe5c5a1ce7f53dcdca)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agouserfaultfd: call handle_userfault() for userfaultfd_missing() faults
Andrea Arcangeli [Fri, 4 Sep 2015 22:46:20 +0000 (15:46 -0700)]
userfaultfd: call handle_userfault() for userfaultfd_missing() faults

Orabug: 21685254

This is where the page faults must be modified to call
handle_userfault() if userfaultfd_missing() is true (so if the
vma->vm_flags had VM_UFFD_MISSING set).

handle_userfault() then takes care of blocking the page fault and
delivering it to userland.

The fault flags must also be passed as parameter so the "read|write"
kind of fault can be passed to userland.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 6b251fc96cf2cdf1ce4b5db055547e2a5679bc77)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agouserfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP
Andrea Arcangeli [Fri, 4 Sep 2015 22:46:17 +0000 (15:46 -0700)]
userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP

Orabug: 21685254

These two flags gets set in vma->vm_flags to tell the VM common code
if the userfaultfd is armed and in which mode (only tracking missing
faults, only tracking wrprotect faults or both). If neither flags is
set it means the userfaultfd is not armed on the vma.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 16ba6f811dfe44bc14f7946a4b257b85476fc16e)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agouserfaultfd: add vm_userfaultfd_ctx to the vm_area_struct
Mike Kravetz [Thu, 14 Jan 2016 23:52:25 +0000 (15:52 -0800)]
userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct

Orabug: 21685254

This adds the vm_userfaultfd_ctx to the vm_area_struct.

Oracle specific changes to maintain KABI also made.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 745f234be12b6191b15eae8dd415cc81a9137f47)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
  Conflicts:
include/linux/mm_types.h

8 years agouserfaultfd: linux/userfaultfd_k.h
Andrea Arcangeli [Fri, 4 Sep 2015 22:46:10 +0000 (15:46 -0700)]
userfaultfd: linux/userfaultfd_k.h

Orabug: 21685254

Kernel header defining the methods needed by the VM common code to
interact with the userfaultfd.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 932b18e0aec65acb089f4bd8761ee85e70f8eb6a)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agouserfaultfd: uAPI
Andrea Arcangeli [Fri, 4 Sep 2015 22:46:04 +0000 (15:46 -0700)]
userfaultfd: uAPI

Orabug: 21685254

Defines the uAPI of the userfaultfd, notably the ioctl numbers and protocol.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1038628d80e96e3a086189172d9be8eb85ecfabf)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agouserfaultfd: linux/Documentation/vm/userfaultfd.txt
Andrea Arcangeli [Fri, 4 Sep 2015 22:46:00 +0000 (15:46 -0700)]
userfaultfd: linux/Documentation/vm/userfaultfd.txt

Orabug: 21685254

This is the latest userfaultfd patchset.  The postcopy live migration
feature on the qemu side is mostly ready to be merged and it entirely
depends on the userfaultfd syscall to be merged as well.  So it'd be great
if this patchset could be reviewed for merging in -mm.

Userfaults allow to implement on demand paging from userland and more
generally they allow userland to more efficiently take control of the
behavior of page faults than what was available before (PROT_NONE +
SIGSEGV trap).

The use cases are:

1) KVM postcopy live migration (one form of cloud memory
   externalization).

   KVM postcopy live migration is the primary driver of this work:

    http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/
    http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html

2) postcopy live migration of binaries inside linux containers:

    http://thread.gmane.org/gmane.linux.kernel.mm/132662

3) KVM postcopy live snapshotting (allowing to limit/throttle the
   memory usage, unlike fork would, plus the avoidance of fork
   overhead in the first place).

   While the wrprotect tracking is not implemented yet, the syscall API is
   already contemplating the wrprotect fault tracking and it's generic enough
   to allow its later implementation in a backwards compatible fashion.

4) KVM userfaults on shared memory. The UFFDIO_COPY lowlevel method
   should be extended to work also on tmpfs and then the
   uffdio_register.ioctls will notify userland that UFFDIO_COPY is
   available even when the registered virtual memory range is tmpfs
   backed.

5) alternate mechanism to notify web browsers or apps on embedded
   devices that volatile pages have been reclaimed. This basically
   avoids the need to run a syscall before the app can access with the
   CPU the virtual regions marked volatile. This depends on point 4)
   to be fulfilled first, as volatile pages happily apply to tmpfs.

Even though there wasn't a real use case requesting it yet, it also
allows to implement distributed shared memory in a way that readonly
shared mappings can exist simultaneously in different hosts and they
can be become exclusive at the first wrprotect fault.

This patch (of 22):

Add documentation.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 25edd8bffd0f7563f0c04c1d219eb89061ce9886)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agomm/hugetlbfs: unmap pages if page fault raced with hole punch
Mike Kravetz [Sat, 16 Jan 2016 00:57:40 +0000 (16:57 -0800)]
mm/hugetlbfs: unmap pages if page fault raced with hole punch

Orabug: 21685254

Page faults can race with fallocate hole punch.  If a page fault happens
between the unmap and remove operations, the page is not removed and
remains within the hole.  This is not the desired behavior.  The race is
difficult to detect in user level code as even in the non-race case, a
page within the hole could be faulted back in before fallocate returns.
If userfaultfd is expanded to support hugetlbfs in the future, this race
will be easier to observe.

If this race is detected and a page is mapped, the remove operation
(remove_inode_hugepages) will unmap the page before removing.  The unmap
within remove_inode_hugepages occurs with the hugetlb_fault_mutex held
so that no other faults will be processed until the page is removed.

The (unmodified) routine hugetlb_vmdelete_list was moved ahead of
remove_inode_hugepages to satisfy the new reference.

[akpm@linux-foundation.org: move hugetlb_vmdelete_list()]
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 4aae8d1c051ea00b456da6811bc36d1f69de5445)
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agoxfs: validate metadata LSNs against log on v5 superblocks
Brian Foster [Mon, 12 Oct 2015 04:59:25 +0000 (15:59 +1100)]
xfs: validate metadata LSNs against log on v5 superblocks

From a45086e27dfa21a4b39134f7505c8f60a3ecdec4 Mon Sep 17 00:00:00 2001

Since the onset of v5 superblocks, the LSN of the last modification has
been included in a variety of on-disk data structures. This LSN is used
to provide log recovery ordering guarantees (e.g., to ensure an older
log recovery item is not replayed over a newer target data structure).

While this works correctly from the point a filesystem is formatted and
mounted, userspace tools have some problematic behaviors that defeat
this mechanism. For example, xfs_repair historically zeroes out the log
unconditionally (regardless of whether corruption is detected). If this
occurs, the LSN of the filesystem is reset and the log is now in a
problematic state with respect to on-disk metadata structures that might
have a larger LSN. Until either the log catches up to the highest
previously used metadata LSN or each affected data structure is modified
and written out without incident (which resets the metadata LSN), log
recovery is susceptible to filesystem corruption.

This problem is ultimately addressed and repaired in the associated
userspace tools. The kernel is still responsible to detect the problem
and notify the user that something is wrong. Check the superblock LSN at
mount time and fail the mount if it is invalid. From that point on,
trigger verifier failure on any metadata I/O where an invalid LSN is
detected. This results in a filesystem shutdown and guarantees that we
do not log metadata changes with invalid LSNs on disk. Since this is a
known issue with a known recovery path, present a warning to instruct
the user how to recover.

Orabug: 25062171

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
8 years agoIB/ipoib: move back IB LL address into the hard header
Paolo Abeni [Thu, 13 Oct 2016 16:26:56 +0000 (18:26 +0200)]
IB/ipoib: move back IB LL address into the hard header

Orabug: 24469379

[ Backport of upstream commit fc791b633515 ]

After the commit 9207f9d45b0a ("net: preserve IP control block
during GSO segmentation"), the GSO CB and the IPoIB CB conflict.
That destroy the IPoIB address information cached there,
causing a severe performance regression, as better described here:

http://marc.info/?l=linux-kernel&m=146787279825501&w=2

This change moves the data cached by the IPoIB driver from the
skb control lock into the IPoIB hard header, as done before
the commit 936d7de3d736 ("IPoIB: Stop lying about hard_header_len
and use skb->cb to stash LL addresses").
In order to avoid GRO issue, on packet reception, the IPoIB driver
stash into the skb a dummy pseudo header, so that the received
packets have actually a hard header matching the declared length.
To avoid changing the connected mode maximum mtu, the allocated
head buffer size is increased by the pseudo header length.

After this commit, IPoIB performances are back to pre-regression
value.

v2 -> v3: rebased
v1 -> v2: avoid changing the max mtu, increasing the head buf size

Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
8 years agonet: preserve IP control block during GSO segmentation
Konstantin Khlebnikov [Fri, 8 Jan 2016 12:21:46 +0000 (15:21 +0300)]
net: preserve IP control block during GSO segmentation

Orabug: 24469379

[ Upstream commit 9207f9d45b0ad071baa128e846d7e7ed85016df3 ]

Skb_gso_segment() uses skb control block during segmentation.
This patch adds 32-bytes room for previous control block which
will be copied into all resulting segments.

This patch fixes kernel crash during fragmenting forwarded packets.
Fragmentation requires valid IP CB in skb for clearing ip options.
Also patch removes custom save/restore in ovs code, now it's redundant.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Link: http://lkml.kernel.org/r/CALYGNiP-0MZ-FExV2HutTvE9U-QQtkKSoE--KN=JQE5STYsjAA@mail.gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit abefd1b4087b9b5e83e7b4e7689f8b8e3cb2899c)

Signed-off-by: Dan Duval <dan.duval@oracle.com>
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
8 years agoxfs: fix broken multi-fsb buffer logging
Brian Foster [Wed, 1 Jun 2016 07:38:12 +0000 (17:38 +1000)]
xfs: fix broken multi-fsb buffer logging

Orabug: 24400444
Upstream-commit: a3916e528b917851a4d2379e2fd2579ad5f2b5a7

Multi-block buffers are logged based on buffer offset in
xfs_trans_log_buf(). xfs_buf_item_log() ultimately walks each mapping in
the buffer and marks the associated range to be logged in the
xfs_buf_log_format bitmap for that mapping. This code is broken,
however, in that it marks the actual buffer offsets of the associated
range in each bitmap rather than shifting to the byte range for that
particular mapping.

For example, on a 4k fsb fs, buffer offset 4096 refers to the first byte
of the second mapping in the buffer. This means byte 0 of the second log
format bitmap should be tagged as dirty. Instead, the current code marks
byte offset 4096 of the second log format bitmap, which is invalid and
potentially out of range of the mapping.

As a result of this, the log item format code invoked at transaction
commit time is not be able to correctly identify what parts of the
buffer to copy into log vectors. This can lead to NULL log vector
pointer dereferences in CIL push context if the item format code was not
able to locate any dirty ranges at all. This crash has been reproduced
on a 4k FSB filesystem using 16k directory blocks where an unlink
operation happened not to log anything in the first block of the
mapping. The logged offsets were all over 4k, marked as such in the
subsequent log format mappings, and thus left the transaction with an
xfs_log_item that is marked DIRTY but without any logged regions.

Further, even when the logged regions are marked correctly in the buffer
log format bitmaps, the format code doesn't copy the correct ranges of
the buffer into the log. This means that any logged region beyond the
first block of a multi-block buffer is subject to corruption after a
crash and log recovery sequence. This is due to a failure to convert the
mapping bm_len field from basic blocks to bytes in the buffer offset
tracking code in xfs_buf_item_format().

Update xfs_buf_item_log() to convert buffer offsets to segment relative
offsets when logging multi-block buffers. This ensures that the modified
regions of a buffer are logged correctly and avoids the aforementioned
crash. Also update xfs_buf_item_format() to correctly track the source
offset into the buffer for the log vector formatting code. This ensures
that the correct data is copied into the log.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
8 years agoxfs: Split default quota limits by quota type
Carlos Maiolino [Mon, 8 Feb 2016 00:27:55 +0000 (11:27 +1100)]
xfs: Split default quota limits by quota type

Orabug: 24399524
Upstream-commit: be6079461abf796e29d02b450a16908f4bf58f6c

Default quotas are globally set due historical reasons. IRIX only
supported user and project quotas, and default quota was only
applied to user quotas.

In Linux, when a default quota is set, all different quota types
inherits the same default value.

An user with a quota limit larger than the default quota value, will
still be limited to the default value because the group quotas also
inherits the default quotas. Unless the group which the user belongs
to have a custom quota limit set.

This patch aims to split the default quota value by quota type.
Allowing each quota type having different default values.

Default time limits are still set globally. XFS does not set a
per-user/group timer, but a single global timer. For changing this
behavior, some changes should be made in user-space tools another
bugs being fixed.

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
8 years agopacket: fix race condition in packet_set_ring
Philip Pettersson [Wed, 30 Nov 2016 22:55:36 +0000 (14:55 -0800)]
packet: fix race condition in packet_set_ring

When packet_set_ring creates a ring buffer it will initialize a
struct timer_list if the packet version is TPACKET_V3. This value
can then be raced by a different thread calling setsockopt to
set the version to TPACKET_V1 before packet_set_ring has finished.

This leads to a use-after-free on a function pointer in the
struct timer_list when the socket is closed as the previously
initialized timer will not be deleted.

The bug is fixed by taking lock_sock(sk) in packet_setsockopt when
changing the packet version while also taking the lock at the start
of packet_set_ring.

Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
Signed-off-by: Philip Pettersson <philip.pettersson@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 84ac7260236a49c79eede91617700174c2c19b0c)

Orabug: 25209594
CVE: CVE-2016-8655
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agonet: avoid signed overflows for SO_{SND|RCV}BUFFORCE
Eric Dumazet [Fri, 2 Dec 2016 17:44:53 +0000 (09:44 -0800)]
net: avoid signed overflows for SO_{SND|RCV}BUFFORCE

CAP_NET_ADMIN users should not be allowed to set negative
sk_sndbuf or sk_rcvbuf values, as it can lead to various memory
corruptions, crashes, OOM...

Note that before commit 82981930125a ("net: cleanups in
sock_setsockopt()"), the bug was even more serious, since SO_SNDBUF
and SO_RCVBUF were vulnerable.

This needs to be backported to all known linux kernels.

Again, many thanks to syzkaller team for discovering this gem.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit b98b0bc8c431e3ceb4b26b0dfc8db509518fb290)

Orabug: 25203090
CVE: CVE-2016-9793
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agoALSA: pcm : Call kill_fasync() in stream lock
Takashi Iwai [Mon, 5 Dec 2016 19:16:37 +0000 (11:16 -0800)]
ALSA: pcm : Call kill_fasync() in stream lock

Currently kill_fasync() is called outside the stream lock in
snd_pcm_period_elapsed().  This is potentially racy, since the stream
may get released even during the irq handler is running.  Although
snd_pcm_release_substream() calls snd_pcm_drop(), this doesn't
guarantee that the irq handler finishes, thus the kill_fasync() call
outside the stream spin lock may be invoked after the substream is
detached, as recently reported by KASAN.

As a quick workaround, move kill_fasync() call inside the stream
lock.  The fasync is rarely used interface, so this shouldn't have a
big impact from the performance POV.

Ideally, we should implement some sync mechanism for the proper finish
of stream and irq handler.  But this oneliner should suffice for most
cases, so far.

Reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
(cherry picked from commit 3aa02cb664c5fb1042958c8d1aa8c35055a2ebc4)

Conflicts:

sound/core/pcm_lib.c

Orabug: 25203165
CVE: CVE-2016-9794
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agonetlink: Fix dump skb leak/double free
Herbert Xu [Mon, 16 May 2016 09:28:16 +0000 (17:28 +0800)]
netlink: Fix dump skb leak/double free

When we free cb->skb after a dump, we do it after releasing the
lock.  This means that a new dump could have started in the time
being and we'll end up freeing their skb instead of ours.

This patch saves the skb and module before we unlock so we free
the right memory.

Fixes: 16b304f3404f ("netlink: Eliminate kmalloc in netlink dump operation.")
Reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 92964c79b357efd980812c4de5c1fd2ec8bb5520)

Orabug: 25203221
CVE: CVE-2016-9806
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agorcu: Fix soft lockup for rcu_nocb_kthread
Ding Tianhong [Wed, 15 Jun 2016 07:27:36 +0000 (15:27 +0800)]
rcu: Fix soft lockup for rcu_nocb_kthread

Carrying out the following steps results in a softlockup in the
RCU callback-offload (rcuo) kthreads:

1. Connect to ixgbevf, and set the speed to 10Gb/s.
2. Use ifconfig to bring the nic up and down repeatedly.

[  317.005148] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[  368.106005] BUG: soft lockup - CPU#1 stuck for 22s! [rcuos/1:15]
[  368.106005] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  368.106005] task: ffff88057dd8a220 ti: ffff88057dd9c000 task.ti: ffff88057dd9c000
[  368.106005] RIP: 0010:[<ffffffff81579e04>]  [<ffffffff81579e04>] fib_table_lookup+0x14/0x390
[  368.106005] RSP: 0018:ffff88061fc83ce8  EFLAGS: 00000286
[  368.106005] RAX: 0000000000000001 RBX: 00000000020155c0 RCX: 0000000000000001
[  368.106005] RDX: ffff88061fc83d50 RSI: ffff88061fc83d70 RDI: ffff880036d11a00
[  368.106005] RBP: ffff88061fc83d08 R08: 0000000000000001 R09: 0000000000000000
[  368.106005] R10: ffff880036d11a00 R11: ffffffff819e0900 R12: ffff88061fc83c58
[  368.106005] R13: ffffffff816154dd R14: ffff88061fc83d08 R15: 00000000020155c0
[  368.106005] FS:  0000000000000000(0000) GS:ffff88061fc80000(0000) knlGS:0000000000000000
[  368.106005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  368.106005] CR2: 00007f8c2aee9c40 CR3: 000000057b222000 CR4: 00000000000407e0
[  368.106005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  368.106005] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  368.106005] Stack:
[  368.106005]  00000000010000c0 ffff88057b766000 ffff8802e380b000 ffff88057af03e00
[  368.106005]  ffff88061fc83dc0 ffffffff815349a6 ffff88061fc83d40 ffffffff814ee146
[  368.106005]  ffff8802e380af00 00000000e380af00 ffffffff819e0900 020155c0010000c0
[  368.106005] Call Trace:
[  368.106005]  <IRQ>
[  368.106005]
[  368.106005]  [<ffffffff815349a6>] ip_route_input_noref+0x516/0xbd0
[  368.106005]  [<ffffffff814ee146>] ? skb_release_data+0xd6/0x110
[  368.106005]  [<ffffffff814ee20a>] ? kfree_skb+0x3a/0xa0
[  368.106005]  [<ffffffff8153698f>] ip_rcv_finish+0x29f/0x350
[  368.106005]  [<ffffffff81537034>] ip_rcv+0x234/0x380
[  368.106005]  [<ffffffff814fd656>] __netif_receive_skb_core+0x676/0x870
[  368.106005]  [<ffffffff814fd868>] __netif_receive_skb+0x18/0x60
[  368.106005]  [<ffffffff814fe4de>] process_backlog+0xae/0x180
[  368.106005]  [<ffffffff814fdcb2>] net_rx_action+0x152/0x240
[  368.106005]  [<ffffffff81077b3f>] __do_softirq+0xef/0x280
[  368.106005]  [<ffffffff8161619c>] call_softirq+0x1c/0x30
[  368.106005]  <EOI>
[  368.106005]
[  368.106005]  [<ffffffff81015d95>] do_softirq+0x65/0xa0
[  368.106005]  [<ffffffff81077174>] local_bh_enable+0x94/0xa0
[  368.106005]  [<ffffffff81114922>] rcu_nocb_kthread+0x232/0x370
[  368.106005]  [<ffffffff81098250>] ? wake_up_bit+0x30/0x30
[  368.106005]  [<ffffffff811146f0>] ? rcu_start_gp+0x40/0x40
[  368.106005]  [<ffffffff8109728f>] kthread+0xcf/0xe0
[  368.106005]  [<ffffffff810971c0>] ? kthread_create_on_node+0x140/0x140
[  368.106005]  [<ffffffff816147d8>] ret_from_fork+0x58/0x90
[  368.106005]  [<ffffffff810971c0>] ? kthread_create_on_node+0x140/0x140

==================================cut here==============================

It turns out that the rcuos callback-offload kthread is busy processing
a very large quantity of RCU callbacks, and it is not reliquishing the
CPU while doing so.  This commit therefore adds an cond_resched_rcu_qs()
within the loop to allow other tasks to run.

Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
[ paulmck: Substituted cond_resched_rcu_qs for cond_resched. ]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
(cherry picked from commit bedc1969150d480c462cdac320fa944b694a7162)

Orabug: 25165170
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agompi: Fix NULL ptr dereference in mpi_powm() [ver #3]
Andrey Ryabinin [Thu, 24 Nov 2016 13:23:10 +0000 (13:23 +0000)]
mpi: Fix NULL ptr dereference in mpi_powm() [ver #3]

This fixes CVE-2016-8650.

If mpi_powm() is given a zero exponent, it wants to immediately return
either 1 or 0, depending on the modulus.  However, if the result was
initalised with zero limb space, no limbs space is allocated and a
NULL-pointer exception ensues.

Fix this by allocating a minimal amount of limb space for the result when
the 0-exponent case when the result is 1 and not touching the limb space
when the result is 0.

This affects the use of RSA keys and X.509 certificates that carry them.

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff8138ce5d>] mpi_powm+0x32/0x7e6
PGD 0
Oops: 0002 [#1] SMP
Modules linked in:
CPU: 3 PID: 3014 Comm: keyctl Not tainted 4.9.0-rc6-fscache+ #278
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
task: ffff8804011944c0 task.stack: ffff880401294000
RIP: 0010:[<ffffffff8138ce5d>]  [<ffffffff8138ce5d>] mpi_powm+0x32/0x7e6
RSP: 0018:ffff880401297ad8  EFLAGS: 00010212
RAX: 0000000000000000 RBX: ffff88040868bec0 RCX: ffff88040868bba0
RDX: ffff88040868b260 RSI: ffff88040868bec0 RDI: ffff88040868bee0
RBP: ffff880401297ba8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000047 R11: ffffffff8183b210 R12: 0000000000000000
R13: ffff8804087c7600 R14: 000000000000001f R15: ffff880401297c50
FS:  00007f7a7918c700(0000) GS:ffff88041fb80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000401250000 CR4: 00000000001406e0
Stack:
 ffff88040868bec0 0000000000000020 ffff880401297b00 ffffffff81376cd4
 0000000000000100 ffff880401297b10 ffffffff81376d12 ffff880401297b30
 ffffffff81376f37 0000000000000100 0000000000000000 ffff880401297ba8
Call Trace:
 [<ffffffff81376cd4>] ? __sg_page_iter_next+0x43/0x66
 [<ffffffff81376d12>] ? sg_miter_get_next_page+0x1b/0x5d
 [<ffffffff81376f37>] ? sg_miter_next+0x17/0xbd
 [<ffffffff8138ba3a>] ? mpi_read_raw_from_sgl+0xf2/0x146
 [<ffffffff8132a95c>] rsa_verify+0x9d/0xee
 [<ffffffff8132acca>] ? pkcs1pad_sg_set_buf+0x2e/0xbb
 [<ffffffff8132af40>] pkcs1pad_verify+0xc0/0xe1
 [<ffffffff8133cb5e>] public_key_verify_signature+0x1b0/0x228
 [<ffffffff8133d974>] x509_check_for_self_signed+0xa1/0xc4
 [<ffffffff8133cdde>] x509_cert_parse+0x167/0x1a1
 [<ffffffff8133d609>] x509_key_preparse+0x21/0x1a1
 [<ffffffff8133c3d7>] asymmetric_key_preparse+0x34/0x61
 [<ffffffff812fc9f3>] key_create_or_update+0x145/0x399
 [<ffffffff812fe227>] SyS_add_key+0x154/0x19e
 [<ffffffff81001c2b>] do_syscall_64+0x80/0x191
 [<ffffffff816825e4>] entry_SYSCALL64_slow_path+0x25/0x25
Code: 56 41 55 41 54 53 48 81 ec a8 00 00 00 44 8b 71 04 8b 42 04 4c 8b 67 18 45 85 f6 89 45 80 0f 84 b4 06 00 00 85 c0 75 2f 41 ff ce <49> c7 04 24 01 00 00 00 b0 01 75 0b 48 8b 41 18 48 83 38 01 0f
RIP  [<ffffffff8138ce5d>] mpi_powm+0x32/0x7e6
 RSP <ffff880401297ad8>
CR2: 0000000000000000
---[ end trace d82015255d4a5d8d ]---

Basically, this is a backport of a libgcrypt patch:

http://git.gnupg.org/cgi-bin/gitweb.cgi?p=libgcrypt.git;a=patch;h=6e1adb05d290aeeb1c230c763970695f4a538526

Fixes: cdec9cb5167a ("crypto: GnuPG based MPI lib - source files (part 1)")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Dmitry Kasatkin <dmitry.kasatkin@gmail.com>
cc: linux-ima-devel@lists.sourceforge.net
cc: stable@vger.kernel.org
Signed-off-by: James Morris <james.l.morris@oracle.com>
(cherry picked from commit f5527fffff3f002b0a6b376163613b82f69de073)

Orabug: 25151496
CVE: CVE-2016-8650
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agosctp: validate chunk len before actually using it
Marcelo Ricardo Leitner [Tue, 25 Oct 2016 16:27:39 +0000 (14:27 -0200)]
sctp: validate chunk len before actually using it

Andrey Konovalov reported that KASAN detected that SCTP was using a slab
beyond the boundaries. It was caused because when handling out of the
blue packets in function sctp_sf_ootb() it was checking the chunk len
only after already processing the first chunk, validating only for the
2nd and subsequent ones.

The fix is to just move the check upwards so it's also validated for the
1st chunk.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit bf911e985d6bbaa328c20c3e05f4eb03de11fdd6)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
 net/sctp/sm_statefuns.c

Orabug: 25142846
CVE: CVE-2016-9555
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agokvm: raise KVM_SOFT_MAX_VCPUS to support more vcpus
Dan Duval [Wed, 17 Jun 2015 17:09:26 +0000 (13:09 -0400)]
kvm: raise KVM_SOFT_MAX_VCPUS to support more vcpus

Orabug: 24820591

Despite the name, the kernel's manifest constant KVM_SOFT_MAX_VCPUS
defines the hard maximum number of vcpus that libvirt will support.
(When "--vcpus <n>" is specified on the virt-install command line,
the utility queries the kernel via ioctl(..., KVM_CAP_NR_VCPUS)
and uses the returned value as its limit.  Right now, the ioctl
implementation simply returns KVM_SOFT_MAX_VCPUS.)

Signed-off-by: Dan Duval <dan.duval@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
(cherry picked from commit dc6eb6241ff41cd2e5e73bc6ebd71a754fb5333c)
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com>
Conflicts:
arch/x86/include/asm/kvm_host.h

Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agonetfilter: nfnetlink: fix splat due to incorrect socket memory accounting in skbuff...
Pablo Neira Ayuso [Wed, 9 Dec 2015 11:09:56 +0000 (12:09 +0100)]
netfilter: nfnetlink: fix splat due to incorrect socket memory accounting in skbuff clones

If we attach the sk to the skb from nfnetlink_rcv_batch(), then
netlink_skb_destructor() will underflow the socket receive memory
counter and we get warning splat when releasing the socket.

$ cat /proc/net/netlink
sk       Eth Pid    Groups   Rmem     Wmem     Dump     Locks     Drops     Inode
ffff8800ca903000 12  0      00000000 -54144   0        0 2        0        17942
                                     ^^^^^^

Rmem above shows an underflow.

And here below the warning splat:

[ 1363.815976] WARNING: CPU: 2 PID: 1356 at net/netlink/af_netlink.c:958 netlink_sock_destruct+0x80/0xb9()
[...]
[ 1363.816152] CPU: 2 PID: 1356 Comm: kworker/u16:1 Tainted: G        W       4.4.0-rc1+ #153
[ 1363.816155] Hardware name: LENOVO 23259H1/23259H1, BIOS G2ET32WW (1.12 ) 05/30/2012
[ 1363.816160] Workqueue: netns cleanup_net
[ 1363.816163]  0000000000000000 ffff880119203dd0 ffffffff81240204 0000000000000000
[ 1363.816169]  ffff880119203e08 ffffffff8104db4b ffffffff813d49a1 ffff8800ca771000
[ 1363.816174]  ffffffff81a42b00 0000000000000000 ffff8800c0afe1e0 ffff880119203e18
[ 1363.816179] Call Trace:
[ 1363.816181]  <IRQ>  [<ffffffff81240204>] dump_stack+0x4e/0x79
[ 1363.816193]  [<ffffffff8104db4b>] warn_slowpath_common+0x9a/0xb3
[ 1363.816197]  [<ffffffff813d49a1>] ? netlink_sock_destruct+0x80/0xb9

skb->sk was only needed to lookup for the netns, however we don't need
this anymore since 633c9a840d0b ("netfilter: nfnetlink: avoid recurrent
netns lookups in call_batch") so this patch removes this manual socket
assignment to resolve this problem.

Reported-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
Reported-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
(cherry picked from commit bd678e09dc1797bc0e2e536b6b268e7cf46e0184)

Orabug: 24749203
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agonetfilter: nfnetlink: avoid recurrent netns lookups in call_batch
Pablo Neira Ayuso [Wed, 9 Dec 2015 11:08:26 +0000 (12:08 +0100)]
netfilter: nfnetlink: avoid recurrent netns lookups in call_batch

Pass the net pointer to the call_batch callback functions so we can skip
recurrent lookups.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
(cherry picked from commit 633c9a840d0bf1cce690f3165bdacd8ab412949e)

Orabug: 24749203
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agonetfilter: nf_tables: fix wrong destroy anonymous sets if binding fails
Liping Zhang [Sat, 11 Jun 2016 04:20:28 +0000 (12:20 +0800)]
netfilter: nf_tables: fix wrong destroy anonymous sets if binding fails

When we add a nft rule like follows:
  # nft add rule filter test tcp dport vmap {1: jump test}
-ELOOP error will be returned, and the anonymous set will be
destroyed.

But after that, nf_tables_abort will also try to remove the
element and destroy the set, which was already destroyed and
freed.

If we add a nft wrong rule, nft_tables_abort will do the cleanup
work rightly, so nf_tables_set_destroy call here is redundant and
wrong, remove it.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit a02f424863610a0a7abd80c468839e59cfa4d0d8)

Oragbug: 24749203
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agonetfilter: nf_tables: use reverse traversal commit_list in nf_tables_abort
Xin Long [Mon, 7 Dec 2015 10:48:07 +0000 (18:48 +0800)]
netfilter: nf_tables: use reverse traversal commit_list in nf_tables_abort

When we use 'nft -f' to submit rules, it will build multiple rules into
one netlink skb to send to kernel, kernel will process them one by one.
meanwhile, it add the trans into commit_list to record every commit.
if one of them's return value is -EAGAIN, status |= NFNL_BATCH_REPLAY
will be marked. after all the process is done. it will roll back all the
commits.

now kernel use list_add_tail to add trans to commit, and use
list_for_each_entry_safe to roll back. which means the order of adding
and rollback is the same. that will cause some cases cannot work well,
even trigger call trace, like:

1. add a set into table foo  [return -EAGAIN]:
   commit_list = 'add set trans'
2. del foo:
   commit_list = 'add set trans' -> 'del set trans' -> 'del tab trans'
then nf_tables_abort will be called to roll back:
firstly process 'add set trans':
                   case NFT_MSG_NEWSET:
                        trans->ctx.table->use--;
                        list_del_rcu(&nft_trans_set(trans)->list);

  it will del the set from the table foo, but it has removed when del
  table foo [step 2], then the kernel will panic.

the right order of rollback should be:
  'del tab trans' -> 'del set trans' -> 'add set trans'.
which is opposite with commit_list order.

so fix it by rolling back commits with reverse order in nf_tables_abort.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit a907e36d54e0ff836e55e04531be201bf6b4d8c8)

Orabug: 24749203
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agoixgbevf: Handle previously-freed msix_entries
Mark Rustad [Fri, 28 Oct 2016 17:46:39 +0000 (10:46 -0700)]
ixgbevf: Handle previously-freed msix_entries

Orabug: 24568240

The msix_entries memory can be freed by a previous suspend or
remove, so don't crash on close when it isn't there. Also only
clear the interrupts when the interface is up, because there
aren't any when it is not up.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit eeffceee421b17a1d484679d738f278fbaa01384)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agoPCI: pciehp: Prioritize data-link event over presence detect
Ashok Raj [Sat, 19 Nov 2016 08:32:45 +0000 (00:32 -0800)]
PCI: pciehp: Prioritize data-link event over presence detect

If Slot Status indicates changes in both Data Link Layer Status and
Presence Detect, prioritize the Link status change.

When both events are observed, pciehp currently relies on the Slot Status
Presence Detect State (PDS) to agree with the Link Status Data Link Layer
Active status.  The Presence Detect State, however, may be set to 1 through
out-of-band presence detect even if the link is down, which creates
conflicting events.

Since the Link Status accurately reflects the reachability of the
downstream bus, the Link Status event should take precedence over a
Presence Detect event.  Skip checking the PDC status if we handled a link
event in the same handler.

Orabug: 25312751

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Altered slightly for UEK4.1
(cherry picked from commit 385895fef6b5f4723e33d0e58251c45bc708132d)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agoPCI: pciehp: Leave power indicator on when enabling already-enabled slot
Ashok Raj [Sat, 19 Nov 2016 08:32:46 +0000 (00:32 -0800)]
PCI: pciehp: Leave power indicator on when enabling already-enabled slot

If an error occurs when enabling a slot, pciehp_power_thread() turns off
the power indicator.  But if the only error is that the slot was already
enabled, we should leave the power indicator on.

Return success if called to enable an already-enabled slot.
This is in the same spirit of the special handling for EEXISTS when
pciehp_configure_device() determines the slot devices already exist.

Orabug: 25312751

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
(cherry picked from commit c4ae2adedb38240be5a1a16588406980b948a3e7)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agonet: Fix use after free in the recvmmsg exit path
Arnaldo Carvalho de Melo [Mon, 14 Mar 2016 12:56:35 +0000 (09:56 -0300)]
net: Fix use after free in the recvmmsg exit path

Orabug: 25298601
CVE: CVE-2016-7117

The syzkaller fuzzer hit the following use-after-free:

  Call Trace:
   [<ffffffff8175ea0e>] __asan_report_load8_noabort+0x3e/0x40 mm/kasan/report.c:295
   [<ffffffff851cc31a>] __sys_recvmmsg+0x6fa/0x7f0 net/socket.c:2261
   [<     inline     >] SYSC_recvmmsg net/socket.c:2281
   [<ffffffff851cc57f>] SyS_recvmmsg+0x16f/0x180 net/socket.c:2270
   [<ffffffff86332bb6>] entry_SYSCALL_64_fastpath+0x16/0x7a
  arch/x86/entry/entry_64.S:185

And, as Dmitry rightly assessed, that is because we can drop the
reference and then touch it when the underlying recvmsg calls return
some packets and then hit an error, which will make recvmmsg to set
sock->sk->sk_err, oops, fix it.

Reported-and-Tested-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Fixes: a2e2725541fa ("net: Introduce recvmmsg socket syscall")
http://lkml.kernel.org/r/20160122211644.GC2470@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 34b88a68f26a75e4fded796f1a49c40f82234b7d)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agosignals: avoid unnecessary taking of sighand->siglock
Waiman Long [Tue, 13 Dec 2016 01:52:45 +0000 (12:52 +1100)]
signals: avoid unnecessary taking of sighand->siglock

When running certain database workload on a high-end system with many
CPUs, it was found that spinlock contention in the sigprocmask syscalls
became a significant portion of the overall CPU cycles as shown below.

  9.30%  9.30%  905387  dataserver  /proc/kcore 0x7fff8163f4d2
  [k] _raw_spin_lock_irq
            |
            ---_raw_spin_lock_irq
               |
               |--99.34%-- __set_current_blocked
               |          sigprocmask
               |          sys_rt_sigprocmask
               |          system_call_fastpath
               |          |
               |          |--50.63%-- __swapcontext
               |          |          |
               |          |          |--99.91%-- upsleepgeneric
               |          |
               |          |--49.36%-- __setcontext
               |          |          ktskRun

Looking further into the swapcontext function in glibc, it was found that
the function always call sigprocmask() without checking if there are
changes in the signal mask.

A check was added to the __set_current_blocked() function to avoid taking
the sighand->siglock spinlock if there is no change in the signal mask.
This will prevent unneeded spinlock contention when many threads are
trying to call sigprocmask().

With this patch applied, the spinlock contention in sigprocmask() was
gone.

Link: http://lkml.kernel.org/r/1474979209-11867-1-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stas Sergeev <stsp@list.ru>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c7be96af89d4b53211862d8599b2430e8900ed92)

Orabug: 25255325
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agoaudit: fix a double fetch in audit_log_single_execve_arg()
Paul Moore [Tue, 19 Jul 2016 21:42:57 +0000 (17:42 -0400)]
audit: fix a double fetch in audit_log_single_execve_arg()

Orabug: 25059936
CVE: CVE-2016-6136

There is a double fetch problem in audit_log_single_execve_arg()
where we first check the execve(2) argumnets for any "bad" characters
which would require hex encoding and then re-fetch the arguments for
logging in the audit record[1].  Of course this leaves a window of
opportunity for an unsavory application to munge with the data.

This patch reworks things by only fetching the argument data once[2]
into a buffer where it is scanned and logged into the audit
records(s).  In addition to fixing the double fetch, this patch
improves on the original code in a few other ways: better handling
of large arguments which require encoding, stricter record length
checking, and some performance improvements (completely unverified,
but we got rid of some strlen() calls, that's got to be a good
thing).

As part of the development of this patch, I've also created a basic
regression test for the audit-testsuite, the test can be tracked on
GitHub at the following link:

 * https://github.com/linux-audit/audit-testsuite/issues/25

[1] If you pay careful attention, there is actually a triple fetch
problem due to a strnlen_user() call at the top of the function.

[2] This is a tiny white lie, we do make a call to strnlen_user()
prior to fetching the argument data.  I don't like it, but due to the
way the audit record is structured we really have no choice unless we
copy the entire argument at once (which would require a rather
wasteful allocation).  The good news is that with this patch the
kernel no longer relies on this strnlen_user() value for anything
beyond recording it in the log, we also update it with a trustworthy
value whenever possible.

Reported-by: Pengfei Wang <wpengfeinudt@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 43761473c254b45883a64441dd0bc85a42f3645c)

Conflict:

kernel/auditsc.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agoKEYS: Fix short sprintf buffer in /proc/keys show function
David Howells [Wed, 26 Oct 2016 14:01:54 +0000 (15:01 +0100)]
KEYS: Fix short sprintf buffer in /proc/keys show function

Orabug: 25306361
CVE: CVE-2016-7042

Fix a short sprintf buffer in proc_keys_show().  If the gcc stack protector
is turned on, this can cause a panic due to stack corruption.

The problem is that xbuf[] is not big enough to hold a 64-bit timeout
rendered as weeks:

(gdb) p 0xffffffffffffffffULL/(60*60*24*7)
$2 = 30500568904943

That's 14 chars plus NUL, not 11 chars plus NUL.

Expand the buffer to 16 chars.

I think the unpatched code apparently works if the stack-protector is not
enabled because on a 32-bit machine the buffer won't be overflowed and on a
64-bit machine there's a 64-bit aligned pointer at one side and an int that
isn't checked again on the other side.

The panic incurred looks something like:

Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffff81352ebe
CPU: 0 PID: 1692 Comm: reproducer Not tainted 4.7.2-201.fc24.x86_64 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 0000000000000086 00000000fbbd2679 ffff8800a044bc00 ffffffff813d941f
 ffffffff81a28d58 ffff8800a044bc98 ffff8800a044bc88 ffffffff811b2cb6
 ffff880000000010 ffff8800a044bc98 ffff8800a044bc30 00000000fbbd2679
Call Trace:
 [<ffffffff813d941f>] dump_stack+0x63/0x84
 [<ffffffff811b2cb6>] panic+0xde/0x22a
 [<ffffffff81352ebe>] ? proc_keys_show+0x3ce/0x3d0
 [<ffffffff8109f7f9>] __stack_chk_fail+0x19/0x30
 [<ffffffff81352ebe>] proc_keys_show+0x3ce/0x3d0
 [<ffffffff81350410>] ? key_validate+0x50/0x50
 [<ffffffff8134db30>] ? key_default_cmp+0x20/0x20
 [<ffffffff8126b31c>] seq_read+0x2cc/0x390
 [<ffffffff812b6b12>] proc_reg_read+0x42/0x70
 [<ffffffff81244fc7>] __vfs_read+0x37/0x150
 [<ffffffff81357020>] ? security_file_permission+0xa0/0xc0
 [<ffffffff81246156>] vfs_read+0x96/0x130
 [<ffffffff81247635>] SyS_read+0x55/0xc0
 [<ffffffff817eb872>] entry_SYSCALL_64_fastpath+0x1a/0xa4

Reported-by: Ondrej Kozina <okozina@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Ondrej Kozina <okozina@redhat.com>
cc: stable@vger.kernel.org
Signed-off-by: James Morris <james.l.morris@oracle.com>
(cherry picked from commit 03dab869b7b239c4e013ec82aea22e181e441cfc)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: Replace MSR_NHM_TURBO_RATIO_LIMIT
Srinivas Pandruvada [Wed, 6 Jul 2016 23:07:56 +0000 (16:07 -0700)]
tools/power turbostat: Replace MSR_NHM_TURBO_RATIO_LIMIT

Orabug: 24811361

Replace MSR_NHM_TURBO_RATIO_LIMIT with MSR_TURBO_RATIO_LIMIT.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit ebf5926a001fd0d0d3fdfc2c39c0c8ff0c846c1a)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/turbostat: allow user to alter DESTDIR and PREFIX
Andy Shevchenko [Fri, 17 Jun 2016 12:59:33 +0000 (15:59 +0300)]
tools/turbostat: allow user to alter DESTDIR and PREFIX

Orabug: 24811361

When run
make -C tools DESTDIR=/my/nice/dir turbostat_install
get a message
install: cannot create regular file '/usr/bin/turbostat': Permission denied

Allow user to alter DESTDIR and PREFIX variables.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit ac485cb4b3b79fb7ef979cb243ee16e1a1ffa767)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: work around RC6 counter wrap
Len Brown [Wed, 6 Apr 2016 21:16:00 +0000 (17:16 -0400)]
tools/power turbostat: work around RC6 counter wrap

Orabug: 24811361

Sometimes the rc6 sysfs counter spontaneously resets,
causing turbostat prints a very large number
as it tries to calcuate % = 100 * (old - new) / interval

When we see (old > new), print ***.**% instead
of a bogus huge number.

Note that this detection is not fool-proof, as the counter
could reset several times and still result in new > old.

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 9185e988e9d5bb70b690362e84bb2e4a9d71f2c5)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: initial KBL support
Len Brown [Wed, 6 Apr 2016 21:15:59 +0000 (17:15 -0400)]
tools/power turbostat: initial KBL support

Orabug: 24811361

KBL is similar to SKL

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit cdc57272ea0a0e952c4609b56e157e4d0ec8e956)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: initial SKX support
Len Brown [Wed, 6 Apr 2016 21:15:58 +0000 (17:15 -0400)]
tools/power turbostat: initial SKX support

Orabug: 24811361

SKX has a lot in common with HSX

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit ec53e594c65ab099ca784d62b6f4c191e3a4d7cc)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: decode BXT TSC frequency via CPUID
Len Brown [Wed, 6 Apr 2016 21:15:57 +0000 (17:15 -0400)]
tools/power turbostat: decode BXT TSC frequency via CPUID

Orabug: 24811361

Hard-code BXT ART to 19200MHz, so turbostat --debug
can fully enumerate TSC:

CPUID(0x15): eax_crystal: 3 ebx_tsc: 186 ecx_crystal_hz: 0
TSC: 1190 MHz (19200000 Hz * 186 / 3 / 1000000)

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit e8efbc80db5e824ce2382d5e65429b6b493e71e2)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: initial BXT support
Len Brown [Wed, 6 Apr 2016 21:15:56 +0000 (17:15 -0400)]
tools/power turbostat: initial BXT support

Orabug: 24811361

Broxton has a lot in common with SKL

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit e4085d543e256aff6606ba99ed257f7c06685f3b)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: print IRTL MSRs
Len Brown [Fri, 23 Dec 2016 21:28:24 +0000 (16:28 -0500)]
tools/power turbostat: print IRTL MSRs

Orabug: 24811361

Some processors use the Interrupt Response Time Limit (IRTL) MSR value
to describe the maximum IRQ response time latency for deep
package C-states.  (Though others have the register, but do not use it)
Lets print it out to give insight into the cases where it is used.

IRTL begain in SNB, with PC3/PC6/PC7, and HSW added PC8/PC9/PC10.

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 5a63426e2a18775ed05b20e3bc90c68bacb1f68a)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: SGX state should print only if --debug
Len Brown [Wed, 6 Apr 2016 21:15:54 +0000 (17:15 -0400)]
tools/power turbostat: SGX state should print only if --debug

Orabug: 24811361

The CPUID.SGX bit was printed, even if --debug was used

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 8ae7225591fd15aac89769cbebb3b5ecc8b12fe5)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: bugfix: TDP MSRs print bits fixing
Chen Yu [Sun, 13 Dec 2015 13:09:31 +0000 (21:09 +0800)]
tools/power turbostat: bugfix: TDP MSRs print bits fixing

Orabug: 24811361

MSR_CONFIG_TDP_NOMINAL:
should print all 8 bits of base_ratio (bit 0:7) 0xFF

MSR_CONFIG_TDP_LEVEL_1:
should print all 15 bits of PKG_MIN_PWR_LVL1 (bit 48:62) 0x7FFF
should print all 15 bits of PKG_MAX_PWR_LVL1 (bit 32:46) 0x7FFF
should print all 8 bits of LVL1_RATIO (bit 16:23) 0xFF
should print all 15 bits of PKG_TDP_LVL1 (bit 0:14) 0x7FFF

And the same modification to MSR_CONFIG_TDP_LEVEL_2.

MSR_TURBO_ACTIVATION_RATIO:
should print all 8 bits of MAX_NON_TURBO_RATIO (bit 0:7) 0xFF

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 685b535b2cdb9cdf354321f8af9ed17dcf19d19f)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: correct output for MSR_NHM_SNB_PKG_CST_CFG_CTL dump
Len Brown [Sun, 13 Mar 2016 07:21:22 +0000 (03:21 -0400)]
tools/power turbostat: correct output for MSR_NHM_SNB_PKG_CST_CFG_CTL dump

Orabug: 24811361

MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x1e008008 (...pkg-cstate-limit=0: unlimited)
should print as
MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x1e008008 (...pkg-cstate-limit=8: unlimited)

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 6c34f160df82ae07bfff5b4c51f50622e9c2759e)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: call __cpuid() instead of __get_cpuid()
Len Brown [Sun, 13 Mar 2016 07:14:35 +0000 (03:14 -0400)]
tools/power turbostat: call __cpuid() instead of __get_cpuid()

Orabug: 24811361

turbostat already checks whether calling each cpuid leavf is legal,
and it doesn't look at the function return value,
so call the simpler gcc intrinsic __cpuid() instead of __get_cpuid().

syntax only, no functional change

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 5aea2f7f645b27635b856311dee5b775d277c686)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: indicate SMX and SGX support
Len Brown [Fri, 11 Mar 2016 18:26:03 +0000 (13:26 -0500)]
tools/power turbostat: indicate SMX and SGX support

Orabug: 24811361

SGX presence is related to a SKL power workaround,
so lets show when that is enabled.

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit aa8d8cc79af16e16da04efff1c1a72b1ea4a9e7e)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: detect and work around syscall jitter
Len Brown [Sat, 27 Feb 2016 08:11:29 +0000 (03:11 -0500)]
tools/power turbostat: detect and work around syscall jitter

Orabug: 24811361

The accuracy of Bzy_Mhz and Busy% depend on reading
the TSC, APERF, and MPERF close together in time.

When there is a very short measurement interval,
or a large system is profoundly idle, the changes
in APERF and MPERF may be very small.
They can be small enough that an expensive interrupt
between reading APERF and MPERF can cause the APERF/MPERF
ratio to become inaccurate, resulting in invalid
calculation and display of Bzy_MHz.

A dummy APERF read of APERF makes this problem
much more rare.  Apparently this 1st systemn call
after exiting a long stretch of idle is when we
typically see expensive timer interrupts that cause
large jitter.

For the cases that dummy APERF read fails to prevent,
we compare the latency of the APERF and MPERF reads.
If they differ by more than 2x, we re-issue them.

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 0102b06747c7d24e334d2b27c4b43eed693676f1)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: show GFX%rc6
Len Brown [Sat, 27 Feb 2016 06:28:12 +0000 (01:28 -0500)]
tools/power turbostat: show GFX%rc6

Orabug: 24811361

The column "GFX%c6" show the percentage of time the GPU
is in the "render C6" state, rc6.  Deep package C-states on several
systems depend on the GPU being in RC6.

This information comes from the counter
/sys/class/drm/card0/power/rc6_residency_ms,
as read before and after the measurement interval.

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit fdf676e51f301d207586d9bac509b8ce055bae8a)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: show GFXMHz
Len Brown [Sat, 27 Feb 2016 05:37:54 +0000 (00:37 -0500)]
tools/power turbostat: show GFXMHz

Orabug: 24811361

Under the column "GFXMHz", show a snapshot of this attribute:
/sys/class/graphics/fb0/device/drm/card0/gt_cur_freq_mhz

This is an instantaneous snapshot of what sysfs presents
at the end of the measurement interval.  turbostat does
not average or otherwise perform any math on this value.

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 27d47356b6dfa92042a17a0b474f08910d4c8e8f)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: show IRQs per CPU
Len Brown [Sat, 27 Feb 2016 04:48:05 +0000 (23:48 -0500)]
tools/power turbostat: show IRQs per CPU

Orabug: 24811361

The new IRQ column shows how many interrupts have occurred on each CPU
during the measurement inteval.  This information comes from
the difference between /proc/interrupts shapshots made before
and after the measurement interval.

The first row, the system summary, shows the sum of the IRQS
for all CPUs during that interval.

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 562a2d377bb9882c49debc9e1be7127a1717e242)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: make fewer systems calls
Len Brown [Sat, 27 Feb 2016 01:51:02 +0000 (20:51 -0500)]
tools/power turbostat: make fewer systems calls

Orabug: 24811361

skip the open(2)/close(2) on each msr read
by keeping the /dev/cpu/*/msr files open.

The remaining read(2) is generally far fewer cycles
than the removed open(2) system call.

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 36229897ba966bb0dc9e060222ff17b198252367)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: fix compiler warnings
Len Brown [Wed, 24 Feb 2016 23:16:40 +0000 (18:16 -0500)]
tools/power turbostat: fix compiler warnings

Orabug: 24811361

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 58cc30a4e6086d542302e04eea39b2a438e4c5dd)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
tools/power/x86/turbostat/turbostat.c

Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: add --out option for saving output in a file
Len Brown [Sun, 14 Feb 2016 04:36:17 +0000 (23:36 -0500)]
tools/power turbostat: add --out option for saving output in a file

Orabug: 24811361

By default...

Turbostat --debug gconfiguration info goes to stderr.

In FORK mode, turbostat statistics go to stderr.

In PERIODIC mode, turbostat statistics go to stdout.

These defaults do not change, but an option "--out file"
will send all output above only to the specified file.

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit b7d8c1483bbf6ec9d2dd76d6a1c91a38c3f6ac35)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: re-name "%Busy" field to "Busy%"
Len Brown [Sun, 14 Feb 2016 04:41:53 +0000 (23:41 -0500)]
tools/power turbostat: re-name "%Busy" field to "Busy%"

Orabug: 24811361

some tools processing turbostat output
have difficulty with items that begin with %...

Reported-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 75d2e44e60490ba1fee076a5f4dcfbdc8598e8c4)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: Intel Xeon x200: fix turbo-ratio decoding
Hubert Chrzaniuk [Wed, 10 Feb 2016 13:55:22 +0000 (14:55 +0100)]
tools/power turbostat: Intel Xeon x200: fix turbo-ratio decoding

Orabug: 24811361

Following changes have been made:
- changed MSR_NHM_TURBO_RATIO_LIMIT to MSR_TURBO_RATIO_LIMIT in debug print
  for consistency with Developer Manual
- updated definition of bitfields in MSR_TURBO_RATIO_LIMIT and appropriate
  parsing code
- added x200 to list of architectures that do not support Nahlem compatible
  definition of MSR_TURBO_RATIO_LIMIT register (x200 has the register but
  bits definition is custom)
- fixed typo in code that parses MSR_TURBO_RATIO_LIMIT
  (logical instead of bitwise operator)
- changed MSR_TURBO_RATIO_LIMIT parsing algorithm so the print out had the
  same order as implementations for other platforms

Signed-off-by: Hubert Chrzaniuk <hubert.chrzaniuk@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit cbf97abaf3689652bcddc0741dc49629d1838142)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: Intel Xeon x200: fix erroneous bclk value
Chrzaniuk, Hubert [Wed, 10 Feb 2016 15:35:17 +0000 (16:35 +0100)]
tools/power turbostat: Intel Xeon x200: fix erroneous bclk value

Orabug: 24811361

x200 does not enable any way to programmatically obtain bus clock
speed. Bclk for the architecture has a fixed value of 100 MHz.
At the same time x200 cannot be included in has_snb_msrs since
it does not support C7 idle state.

prior to this patch, MHz values reported on this chip
were erroneously calculated using bclk of 133MHz,
causing MHz values to be reported 33% higher than actual.

Signed-off-by: Hubert Chrzaniuk <hubert.chrzaniuk@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 121b48bb77187cf2ed3053e147d2e6c1e864083c)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: allow sub-sec intervals
Len Brown [Sat, 13 Feb 2016 03:44:48 +0000 (22:44 -0500)]
tools/power turbostat: allow sub-sec intervals

Orabug: 24811361

turbostat -i interval_sec

will sample and display statistics every interval_sec.
interval_sec used to be a whole number of seconds,
but now we accept a decimal, as small as 0.001 sec (1 ms).

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 2a0609c02e6558df6075f258af98a54a74b050ff)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: fix various build warnings
Colin Ian King [Wed, 2 Mar 2016 13:50:25 +0000 (13:50 +0000)]
tools/power turbostat: fix various build warnings

Orabug: 24811361

When building with gcc 6 we're getting various build warnings that just
require some trivial function declaration and call fixes:

  turbostat.c: In function â€˜dump_cstate_pstate_config_info’:
  turbostat.c:1973:1: warning: type of â€˜family’ defaults to â€˜int’
   dump_cstate_pstate_config_info(family, model)
  turbostat.c:1973:1: warning: type of â€˜model’ defaults to â€˜int’
  turbostat.c: In function â€˜get_tdp’:
  turbostat.c:2145:8: warning: type of â€˜model’ defaults to â€˜int’
   double get_tdp(model)
  turbostat.c: In function â€˜perf_limit_reasons_probe’:
  turbostat.c:2259:6: warning: type of â€˜family’ defaults to â€˜int’
   void perf_limit_reasons_probe(family, model)
  turbostat.c:2259:6: warning: type of â€˜model’ defaults to â€˜int’

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-wbicer8n0s9qe6ql8h9x478e@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
(cherry picked from commit 1b69317d2dc80bc8e1d005e1a771c4f5bff3dabe)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: Decode MSR_MISC_PWR_MGMT
Len Brown [Thu, 3 Dec 2015 06:35:36 +0000 (01:35 -0500)]
tools/power turbostat: Decode MSR_MISC_PWR_MGMT

Orabug: 24811361

This MSR is helpful to show if P-state HW coordination
is enabled or disabled.

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit f0057310b40efe9f797ff337e9464e6a6fb9d782)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: decode HWP registers
Len Brown [Tue, 1 Dec 2015 06:36:39 +0000 (01:36 -0500)]
tools/power turbostat: decode HWP registers

Orabug: 24811361

 # turbostat --debug
...
CPUID(6): ... HWP, HWPnotify, HWPwindow, HWPepp, HWPpkg ...
...
cpu0: MSR_PM_ENABLE: 0x00000001 (HWP)
cpu0: MSR_HWP_CAPABILITIES: 0x01050916 (high 0x16 guar 0x9 eff 0x5 low 0x1)
cpu0: MSR_HWP_REQUEST: 0x80001604 (min 0x4 max 0x16 des 0x0 epp 0x80 window 0x0 pkg 0x0)
cpu0: MSR_HWP_INTERRUPT: 0x00000001 (EN_Guaranteed_Perf_Change, Dis_Excursion_Min)
cpu0: MSR_HWP_STATUS: 0x00000000 (No-Guaranteed_Perf_Change, No-Excursion_Min)

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 7f5c258e1ce1e0909d3195694ac79f143051e513)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: CPUID(0x16) leaf shows base, max, and bus frequency
Len Brown [Mon, 23 Nov 2015 07:30:51 +0000 (02:30 -0500)]
tools/power turbostat: CPUID(0x16) leaf shows base, max, and bus frequency

Orabug: 24811361

This CPUID leaf is available on Skylake:

CPUID(0x16): base_mhz: 1500 max_mhz: 2200 bus_mhz: 100

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 61a87ba7893a256d86c7eea6a7ab10d38ccac9b2)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
tools/power/x86/turbostat/turbostat.c

Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: decode more CPUID fields
Len Brown [Sat, 21 Nov 2015 17:22:47 +0000 (12:22 -0500)]
tools/power turbostat: decode more CPUID fields

Orabug: 24811361

for debugging, dump a few more fields:

CPUID(1): SSE3 MONITOR EIST TM2 TSC MSR ACPI-TM TM

cpu0: MSR_IA32_MISC_ENABLE: 0x00850089 (TCC EIST MONITOR)

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 69807a638f91524ed75027f808cd277417ecee7a)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: use new name for MSR_PLATFORM_INFO
Len Brown [Thu, 12 Nov 2015 07:42:31 +0000 (02:42 -0500)]
tools/power turbostat: use new name for MSR_PLATFORM_INFO

Orabug: 24811361

MSR_PLATFORM_INFO is the new name for MSR_NHM_PLATFORM_INFO

no functional change

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit ec0adc539b8bf59b7c00db0748671f6594b77843)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: bugfix: print MAX_NON_TURBO_RATIO
Len Brown [Thu, 22 Oct 2015 06:42:12 +0000 (02:42 -0400)]
tools/power turbostat: bugfix: print MAX_NON_TURBO_RATIO

Orabug: 24811361

MSR_TURBO_ACTIVATION_RATIO: 0x00000016 (MAX_NON_TURBO_RATIO=6 lock=0)
should print all 7 bits of MAX_NON_TURBO_RATIO (in decimal):
MSR_TURBO_ACTIVATION_RATIO: 0x00000016 (MAX_NON_TURBO_RATIO=22 lock=0)

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 759d2a932b82009a7039ef5567e7dcba153ce123)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: simplify Bzy_MHz calculation
Len Brown [Tue, 20 Oct 2015 02:37:40 +0000 (22:37 -0400)]
tools/power turbostat: simplify Bzy_MHz calculation

Orabug: 24811361

    Bzy_MHz = TSC_delta*tsc_tweak/APERF_delta/MPERF_delta/measurement_interval

becomes

    Bzy_MHz = base_mhz/APERF_delta/MPERF_delta

on systems which support MSR_NHM_PLATFORM_INFO.

base_mhz is calculated directly from the base_ratio
reported in MSR_NHM_PLATFORM_INFO * bclk,
and bclk is discovered via MSR or cpuid.

This reduces the dependency of Bzy_MHz calculation on the TSC.
Previously, there were 4 TSC readings required in each caculation,
the raw TSC delta combined with the measurement_interval.
This also removes the "tsc_tweak" correction factor used when
TSC runs on a different base clock from the CPU's bclk.

After this change, tsc_tweak is used only for %Busy.

The end-result should be a Bzy_MHz result slightly less prone to jitter.

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 21ed5574d1622118b49b0c6342acc8d27d0799be)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: SKL: Adjust for TSC difference from base frequency
Len Brown [Sat, 26 Sep 2015 04:12:38 +0000 (00:12 -0400)]
tools/power turbostat: SKL: Adjust for TSC difference from base frequency

Orabug: 24811361

On a Skylake with 1500MHz base frequency,
the TSC runs at 1512MHz.

This is because the TSC is no longer in the n*100 MHz BCLK domain,
but is now in the m*24MHz crystal clock domain. (24 MHz * 63 = 1512 MHz)

This adds error to several calculations in turbostat,
unless the TSC sample sizes are adjusted for this difference.

Note that calculations in the time domain are immune
from this issue, as the timing sub-system has already
calibrated the TSC against a known wall clock.

AVG_MHz = APERF_delta/measurement_interval

need no adjustment.  APERF_delta is in the BCLK domain,
and measurement_interval is in the time domain.

TSC_MHz  =  TSC_delta/measurement_interval

needs no adjustment -- as we really do want to report
the actual measured TSC delta here, and measurement_interval
is in the accurate time domain.

%Busy = MPERF_delta/TSC_delta

needs adjustment to use TSC_BCLK_DOMAIN_delta.
TSC_BCLK_DOMAIN_delta = TSC_delta * base_hz / tsc_hz

Bzy_MHz = TSC_delta/APERF_delta/MPERF_delta/measurement_interval

need adjustment as above.

No other metrics in turbostat need to be adjusted.

Before:

     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz
       -     550   24.84    2216    1512
       0    2191   98.73    2219    1514
       2       0    0.01    2130    1512
       1       9    0.43    2016    1512
       3       2    0.08    2016    1512

After:

     CPU Avg_MHz   %Busy Bzy_MHz TSC_MHz
       -     550   25.05    2198    1512
       0    2190   99.62    2199    1512
       2       0    0.01    2152    1512
       1       9    0.46    2000    1512
       3       2    0.10    2000    1512

Note that in this example, the "Before" Bzy_MHz
was reported as exceeding the 2200 max turbo rate.
Also, even a pinned spin loop would not be reported
as over 99% busy.

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit a2b7b74945dbfe5d734eafe8aa52f9f1f8bc6931)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: KNL workaround for %Busy and Avg_MHz
Hubert Chrzaniuk [Mon, 14 Sep 2015 11:31:00 +0000 (13:31 +0200)]
tools/power turbostat: KNL workaround for %Busy and Avg_MHz

Orabug: 24811361

KNL increments APERF and MPERF every 1024 clocks.
This is compliant with the architecture specification,
which requires that only the ratio of APERF/MPERF need be valid.

However, turbostat takes advantage of the fact that these
two MSRs increment every un-halted clock
at the actual and base frequency:

AVG_MHz = APERF_delta/measurement_interval

%Busy = MPERF_delta/TSC_delta

This quirk is needed for these calculations to also work on KNL,
which would otherwise show a value 1024x smaller than expected.

Signed-off-by: Hubert Chrzaniuk <hubert.chrzaniuk@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit b2b34dfe4d9aa4c468fc363b3b666974783ed1f9)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: IVB Xeon: fix --debug regression
Len Brown [Sat, 26 Sep 2015 01:12:39 +0000 (21:12 -0400)]
tools/power turbostat: IVB Xeon: fix --debug regression

Orabug: 24811361

Staring in Linux-4.3-rc1,
commit 6fb3143b561c ("tools/power turbostat: dump CONFIG_TDP")
touches MSR 0x648, which is not supported on IVB-Xeon.
This results in "turbostat --debug" exiting on those systems:

turbostat: /dev/cpu/2/msr offset 0x648 read failed: Input/output error

Remove IVB-Xeon from the list of machines supporting with that MSR.

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 756357b8e4b072fd5ee86421f794e071a348802b)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: fix typo on DRAM column in Joules-mode
Len Brown [Fri, 24 Jul 2015 14:35:23 +0000 (10:35 -0400)]
tools/power turbostat: fix typo on DRAM column in Joules-mode

< RAM_W
> RAM_J

Reported-by: Hubert Chrzaniuk <hubert.chrzaniuk@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit bd6906ed3d7a00d55c9bd368a640ef83bb487d1d)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: fix parameter passing for forked command
Len Brown [Thu, 16 Jul 2015 01:49:41 +0000 (21:49 -0400)]
tools/power turbostat: fix parameter passing for forked command

Orabug: 24811361

turbostat supports forked command when sampling cpu state. However,
the forked command is not allowed to be executed with options, otherwise
turbostat might regard these options as invalid turbostat options.

For example:

./turbostat stress -c 4 -t 10
./turbostat: unrecognized option '-t'

Reported-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit a01e72fbc41e322ed229465de8b595a7e3ec6549)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: dump CONFIG_TDP
Len Brown [Wed, 17 Jun 2015 20:23:45 +0000 (16:23 -0400)]
tools/power turbostat: dump CONFIG_TDP

Orabug: 24811361

Config TDP is a feature that allows parts to be configured
for different thermal limits after they have left the factory.

This can have an effect on the operation of the part,
particularly in determiniing...

Max Non-turbo Ratio
Turbo Activation Ratio

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 6fb3143b561c4a7865e5513eeb02d42ef38e8173)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: cpu0 is no longer hard-coded, so update output
Len Brown [Wed, 17 Jun 2015 16:27:21 +0000 (12:27 -0400)]
tools/power turbostat: cpu0 is no longer hard-coded, so update output

Orabug: 24811361

The --debug option reads a number of per-package MSRs.
Previously we explicitly read them on cpu0, but recently
turbostat changed to read them on the current "base_cpu".

Update the print-out to reflect base_cpu, rather than
the hard-coded cpu0.

Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit bfae2052265cde825afaba35eb3a4d3889432734)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agotools/power turbostat: update turbostat(8)
Len Brown [Wed, 3 Jun 2015 11:37:24 +0000 (07:37 -0400)]
tools/power turbostat: update turbostat(8)

Orabug: 24811361

Remove reference to the original Nehalem Turbo white paper,
since it has moved, and these mechanisms have now long since
been documented in the Software Developer's Manual.

Reported-by: Jeremie Lagraviere <jeremie@simula.no>
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 75fd7ffa7fab91c2c3234bd3e465ba4f366733f4)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agomacsec: validate ICV length on link creation
Davide Caratti [Fri, 22 Jul 2016 13:07:58 +0000 (15:07 +0200)]
macsec: validate ICV length on link creation

Test the cipher suite initialization in case ICV length has a value
different than its default. If this test fails, creation of a new macsec
link will also fail. This avoids situations where further security
associations can't be added due to failures of crypto_aead_setauthsize(),
caused by unsupported user-provided values of the ICV length.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit f04c392d2dd97a985878f4380a1b054791301acf)

Orabug: 24614549

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agomacsec: fix error codes when a SA is created
Davide Caratti [Fri, 22 Jul 2016 13:07:57 +0000 (15:07 +0200)]
macsec: fix error codes when a SA is created

preserve the return value of AEAD functions that are called when a SA is
created, to avoid inappropriate display of "RTNETLINK answers: Cannot
allocate memory" message.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 34aedfee22967236adc3d9147c8b47b7f5bad26c)

Orabug: 24614549

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agomacsec: limit ICV length to 16 octets
Davide Caratti [Fri, 22 Jul 2016 13:07:56 +0000 (15:07 +0200)]
macsec: limit ICV length to 16 octets

IEEE 802.1AE-2006 standard recommends that the ICV element in a MACsec
frame should not exceed 16 octets: add MACSEC_STD_ICV_LEN in uapi
definitions accordingly, and avoid accepting configurations where the ICV
length exceeds the standard value. Leave definition of MACSEC_MAX_ICV_LEN
unchanged for backwards compatibility with userspace programs.

Fixes: dece8d2b78d1 ("uapi: add MACsec bits")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 2ccbe2cb79f2f74ab739252299b6f9ff27586f2c)

Orabug: 24614549

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agovlan: use a valid default mtu value for vlan over macsec
Paolo Abeni [Thu, 14 Jul 2016 16:00:10 +0000 (18:00 +0200)]
vlan: use a valid default mtu value for vlan over macsec

macsec can't cope with mtu frames which need vlan tag insertion, and
vlan device set the default mtu equal to the underlying dev's one.
By default vlan over macsec devices use invalid mtu, dropping
all the large packets.
This patch adds a netif helper to check if an upper vlan device
needs mtu reduction. The helper is used during vlan devices
initialization to set a valid default and during mtu updating to
forbid invalid, too bit, mtu values.
The helper currently only check if the lower dev is a macsec device,
if we get more users, we need to update only the helper (possibly
reserving an additional IFF bit).

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 18d3df3eab23796d7f852f9c6bb60962b8372ced)

Orabug: 24614549

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agomacsec: fix netlink attribute for key id
Sabrina Dubroca [Wed, 18 May 2016 11:34:40 +0000 (13:34 +0200)]
macsec: fix netlink attribute for key id

In my last commit I replaced MACSEC_SA_ATTR_KEYID by
MACSEC_SA_ATTR_KEY.

Fixes: 8acca6acebd0 ("macsec: key identifier is 128 bits, not 64")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 1968a0b8b6ca088bc029bd99ee696f1aca4090d0)

Orabug: 24614549

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agomacsec: key identifier is 128 bits, not 64
Sabrina Dubroca [Sat, 7 May 2016 18:19:29 +0000 (20:19 +0200)]
macsec: key identifier is 128 bits, not 64

The MACsec standard mentions a key identifier for each key, but
doesn't specify anything about it, so I arbitrarily chose 64 bits.

IEEE 802.1X-2010 specifies MKA (MACsec Key Agreement), and defines the
key identifier to be 128 bits (96 bits "member identifier" + 32 bits
"key number").

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 8acca6acebd07b238af2e61e4f7d55e6232c7e3a)

Orabug: 24614549

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agomacsec: set actual real device for xmit when !protect_frames
Daniel Borkmann [Thu, 30 Jun 2016 22:00:54 +0000 (00:00 +0200)]
macsec: set actual real device for xmit when !protect_frames

Avoid recursions of dev_queue_xmit() to the wrong net device when
frames are unprotected, since at that time skb->dev still points to
our own macsec dev and unlike macsec_encrypt_finish() dev pointer
doesn't get updated to real underlying device.

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 79c62220d74a4a3f961a2cb7320da09eebf5daf7)

Orabug: 24614549

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agomacsec: fix SA initialization
Sabrina Dubroca [Tue, 14 Jun 2016 13:25:16 +0000 (15:25 +0200)]
macsec: fix SA initialization

The ASYNC flag prevents initialization on some physical machines.

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 6052f7fbce857e327218a9d8a040e210ea7cc718)

Orabug: 24614549

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agomacsec: allocate sg and iv on the heap
Sabrina Dubroca [Tue, 14 Jun 2016 13:25:15 +0000 (15:25 +0200)]
macsec: allocate sg and iv on the heap

For the crypto callbacks to work properly, we cannot have sg and iv on
the stack.  Use kmalloc instead, with a single allocation for
aead_request + scatterlist + iv.

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 5d9649b3a524df9ccd15ac25ad6591089260fdbd)

Orabug: 24614549

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agomacsec: add rcu_barrier() on module exit
Sabrina Dubroca [Tue, 14 Jun 2016 13:25:14 +0000 (15:25 +0200)]
macsec: add rcu_barrier() on module exit

Without this, the various uses of call_rcu could cause a kernel panic.

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit b196c22af5c3ff784c472c80f6fb4e5fad67b2ac)

Orabug: 24614549

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agomacsec: Convert to using IFF_NO_QUEUE
Phil Sutter [Fri, 22 Apr 2016 12:02:42 +0000 (14:02 +0200)]
macsec: Convert to using IFF_NO_QUEUE

Signed-off-by: Phil Sutter <phil@nwl.cc>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit e425974feaa545575135f04e646f0495439b4c54)

Orabug: 24614549

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agomacsec: fix netlink attribute validation
Sabrina Dubroca [Fri, 22 Apr 2016 09:28:09 +0000 (11:28 +0200)]
macsec: fix netlink attribute validation

macsec_validate_attr should check IFLA_MACSEC_REPLAY_PROTECT (not
IFLA_MACSEC_PROTECT) to verify that the replay protection and replay
window arguments are correct.

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 4b1fb9352f351faa067a914907d58a6fe38ac048)

Orabug: 24614549

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agomacsec: add missing macsec prefix in uapi
Sabrina Dubroca [Fri, 22 Apr 2016 09:28:08 +0000 (11:28 +0200)]
macsec: add missing macsec prefix in uapi

I accidentally forgot some MACSEC_ prefixes in if_macsec.h.

Fixes: dece8d2b78d1 ("uapi: add MACsec bits")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 748164802c1bd2c52937d20782b07d8c68dd9a4f)

Orabug: 24614549

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agomacsec: fix SA leak if initialization fails
Sabrina Dubroca [Fri, 22 Apr 2016 09:28:07 +0000 (11:28 +0200)]
macsec: fix SA leak if initialization fails

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Reported-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 38787fc209580f9b5918e93e71da7c960dbb5d8d)

Orabug: 24614549

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agomacsec: fix memory leaks around rx_handler (un)registration
Sabrina Dubroca [Fri, 22 Apr 2016 09:28:06 +0000 (11:28 +0200)]
macsec: fix memory leaks around rx_handler (un)registration

We leak a struct macsec_rxh_data when we unregister the rx_handler in
macsec_dellink.
We also leak a struct macsec_rxh_data in register_macsec_dev if we fail
to register the rx_handler.

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 960d5848dbf1245cc3a310109897937207411c0c)

Orabug: 24614549

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agomacsec: add consistency check to netlink dumps
Sabrina Dubroca [Fri, 22 Apr 2016 09:28:05 +0000 (11:28 +0200)]
macsec: add consistency check to netlink dumps

Use genl_dump_check_consistent in dump_secy.

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Suggested-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 96cfc5052c5d434563873caa7707b32b9e389b16)

Orabug: 24614549

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
8 years agomacsec: fix rx_sa refcounting with decrypt callback
Sabrina Dubroca [Fri, 22 Apr 2016 09:28:04 +0000 (11:28 +0200)]
macsec: fix rx_sa refcounting with decrypt callback

The decrypt callback macsec_decrypt_done needs a reference on the rx_sa
and releases it before returning, but macsec_handle_frame already
put that reference after macsec_decrypt returned NULL.

Set rx_sa to NULL when the decrypt callback runs so that
macsec_handle_frame knows it must not release the reference.

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c3b7d0bd7ac2c501d4806db71ddd383c184968e8)

Orabug: 24614549

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>