Jacob Keller [Wed, 5 Apr 2017 11:50:53 +0000 (07:50 -0400)]
i40e: update error message when trying to add invalid filters
Re-word the error message displayed when adding a filter with an
invalid flow type. Additionally, report a distinct error message when
the IPv4 protocol is at fault.
Change-ID: Iba3d85b87f8d383c97c8bdd180df34a6adf3ee67 Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit a346fb836c712b43fc7bd925534eb8c23b3b61f0) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
Mitch Williams [Thu, 30 Mar 2017 07:46:08 +0000 (00:46 -0700)]
i40e: close client on remove and shutdown
When the driver is removed or shut down, close any attached clients
(i.e. i40iw). This prevents a panic seen sometimes on forced driver
removal or system shutdown when iWarp is running.
Change-ID: I4f6161e5a73ffbb2fd5883567b007310302bfcb5 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 921c467c6bf8f6fe5cd139b0535ad42b952330f0) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
Alexander Duyck [Tue, 14 Mar 2017 17:15:27 +0000 (10:15 -0700)]
i40e/i40evf: Change the way we limit the maximum frame size for Rx
This patch changes the way we handle the maximum frame size for the Rx
path. Previously we were rounding up to 2K for a 1500 MTU and then brining
the max frame size down to MTU plus a fixed amount. With this patch
applied what we now do is limit the maximum frame to 1.5K minus the value
for NET_IP_ALIGN for standard MTU, and for any MTU greater than 1500 we
allow up to the maximum frame size. This makes the behavior more
consistent with the other drivers such as igb which had similar logic. In
addition it reduces the test matrix for MTU since we only have two max
frame sizes that are handled for Rx now.
Change-ID: I23a9d3c857e7df04b0ef28c64df63e659c013f3f Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit dab86afdbbd1bc5d5a89b67ed141d2f46c3b4191) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
Alexander Duyck [Tue, 14 Mar 2017 17:15:26 +0000 (10:15 -0700)]
i40e/i40evf: Add legacy-rx private flag to allow fallback to old Rx flow
This patch adds a control which will allow us to toggle into and out of the
legacy Rx mode. The legacy Rx mode is what we currently do when performing
Rx. As I make further changes what should happen is that the driver will
fall back to the behavior for Rx as of this patch should the "legacy-rx"
flag be set to on.
Change-ID: I0342998849bbb31351cce05f6e182c99174e7751 Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit c424d4a3dd798958074bde7c1dcd8dc08962d820) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
Alexander Duyck [Tue, 14 Mar 2017 17:15:23 +0000 (10:15 -0700)]
i40e/i40evf: Pull code for grabbing and syncing rx_buffer from fetch_buffer
This patch pulls the code responsible for fetching the Rx buffer and
synchronizing DMA into a function, specifically called i40e_get_rx_buffer.
The general idea is to allow for better code reuse by pulling this out of
i40e_fetch_rx_buffer. We dropped a couple of prefetches since the time
between the prefetch being called and the data being accessed was too small
to be useful.
Change-ID: I4885fce4b2637dbedc8e16431169d23d3d7e79b9 Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit 9a064128fc8489e9066fde872f6fdeb3d1bbb84f) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
Alexander Duyck [Tue, 14 Mar 2017 17:15:22 +0000 (10:15 -0700)]
i40e/i40evf: Use length to determine if descriptor is done
This change makes it so that we use the length of the packet instead of the
DD status bit to determine if a new descriptor is ready to be processed.
The obvious advantage is that it cuts down on reads as we don't really even
need the DD bit if going from a 0 to a non-zero value on size is enough to
inform us that the packet has been completed.
Change-ID: Iebdf9cdb36c454ef092df27199b92ad09c374231 Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26785018
(cherry picked from commit d57c0e08c70162feab9ccab085fc34095d2dfd11) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Kyle Fortin <kyle.fortin@oracle.com>
There is still a secure hole left in mem.c driver -- when securelevel is set
userland application could access PCI configuration space via this driver.
--
Attempting to write via mmap() API using acc_test.
[root@ban92uut054 ~]# ./acc_test mmap 0x846000c4=0x1
Using mmap() API for access
mmap write wrote 0x1
if (strcmp("rw", argv[1]) == 0) {
access_type = 1;
printf("Using pread()/pwrite() API for access\n");
}
else if (strcmp("mmap", argv[1]) == 0) {
access_type = 2;
printf("Using mmap() API for access\n");
}
else {
printf("Illegal access type: must be rw or mmap\n");
return -1;
}
addr = strtoul(argv[2], &tmp, 16);
if ((tmp && (*tmp != '=')) &&
((*tmp != '\0') || (errno == EINVAL) ||
(addr == ULONG_MAX && errno == ERANGE))) {
fprintf(stderr, "Invalid address specified; must be hex based\n");
if (errno) perror("error : ");
exit(1);
}
else if (tmp && (*tmp == '=')) { // write case
tmp++;
val = strtoul(tmp, NULL, 16);
operation = 1;
}
This patch is purposed to fix this hole when securelevel is set where one could write to
/dev/mem via the mmap() API. The fix to disallow opening /dev/mem or /dev/kmem
for access. The fix checks access at open rather than have get_securelevel() called at
the various write/read locations.
This issue is specific to UEK4 !
Signed-off-by: James Puthukattukaran <james.puthukattukaran@oracle.com> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com> Reviewed-by: Eric Snowberg <eric.snowberg@oracle.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle>
Igor Mammedov [Fri, 4 Dec 2015 13:07:06 +0000 (14:07 +0100)]
x86/mm/64: Enable SWIOTLB if system has SRAT memory regions above MAX_DMA32_PFN
when memory hotplug enabled system is booted with less
than 4GB of RAM and then later more RAM is hotplugged
32-bit devices stop functioning with following error:
the reason for this is that if x86_64 system were booted
with RAM less than 4GB, it doesn't enable SWIOTLB and
when memory is hotplugged beyond MAX_DMA32_PFN, devices
that expect 32-bit addresses can't handle 64-bit addresses.
Fix it by tracking max possible PFN when parsing
memory affinity structures from SRAT ACPI table and
enable SWIOTLB if there is hotpluggable memory
regions beyond MAX_DMA32_PFN.
It fixes KVM guests when they use emulated devices
(reproduces with ata_piix, e1000 and usb devices,
RHBZ: 1275941, 1275977, 1271527)
It also fixes the HyperV, VMWare with emulated devices
which are affected by this issue as well.
Signed-off-by: Igor Mammedov <imammedo@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akataria@vmware.com Cc: fujita.tomonori@lab.ntt.co.jp Cc: konrad.wilk@oracle.com Cc: pbonzini@redhat.com Cc: revers@redhat.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1449234426-273049-3-git-send-email-imammedo@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit ec941c5ffede4d788b9fc008f9eeca75b9e964f5) Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 26754302
Igor Mammedov [Fri, 4 Dec 2015 13:07:05 +0000 (14:07 +0100)]
x86/mm: Introduce max_possible_pfn
max_possible_pfn will be used for tracking max possible
PFN for memory that isn't present in E820 table and
could be hotplugged later.
By default max_possible_pfn is initialized with max_pfn,
but later it could be updated with highest PFN of
hotpluggable memory ranges declared in ACPI SRAT table
if any present.
Signed-off-by: Igor Mammedov <imammedo@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akataria@vmware.com Cc: fujita.tomonori@lab.ntt.co.jp Cc: konrad.wilk@oracle.com Cc: pbonzini@redhat.com Cc: revers@redhat.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1449234426-273049-2-git-send-email-imammedo@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 8dd3303001976aa8583bf20f6b93590c74114308) Signed-off-by: Vijay Balakrishna <vijay.balakrishna@oracle.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 26754302
Alan Maguire [Wed, 4 Oct 2017 21:18:40 +0000 (23:18 +0200)]
dtrace lockstat provider probes
This patch adds DTrace probes for locking events covering mutexes,
read-write locks and spinlocks and is similar to the lockstat provider
for Solaris. However it differs from the Solaris lockstat provider
in one way - on Linux, rwlocks are implemented via spinlocks so there
is no "rw-block" probe; rather a "rw-spin" probe. Additionally,
rwlocks cannot be upgraded or downgraded, so the "rw-upgrade" and
"rw-downgrade" probes are not present.
The "-acquire" probes fire when the lock is acquired.
The "-spin" probes fire on contention events when then lock needed
to spin. The probe fires just prior to acquisition of locks where
contention occurred and arg1 contains the total time spent spinning.
The "adaptive-block" probe fires on contention events where the thread
blocked waiting on lock acquisition. The probe fires just prior to
lock acquisition and arg1 contains the total sleep time incurred.
The "-error" probe fires when an error occurs when trying to
acquire an adpative lock.
The "-release" probes fire when the lock is released.
Arguments:
arg0: the lock itself (a struct mutex *, spinlock_t *, or rwlock_t *)
arg1:
for rw-acquire/rw-release probes only, arg1 is RW_READER for
acquire/release as reader, RW_WRITER for acquire/release as a writer.
for *-spin or *-block probes, arg1 is the total time in nanoseconds
spent spinning or blocking.
arg2:
for rw-spin only, arg2 is RW_READER when spinning on a rwlock as a reader,
RW_WRITER when spinning on a rwlock as a writer.
Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Kris Van Hees <kris.van.hees@oracle.com>
Enhance diagnosabilty,when an RDS IB/CM connection gets into
"Receiver Not Ready" state.These are the data added to the
per-RDS/IB connection info that is currently displayed through
rds-info:
- w_alloc_ctr of the receive ring (struct rds_ib_work_ring)
- w_free_ctr
- qp_num number of the connection
Thomas Gleixner [Tue, 31 Jan 2017 14:24:03 +0000 (15:24 +0100)]
timerfd: Protect the might cancel mechanism proper
The handling of the might_cancel queueing is not properly protected, so
parallel operations on the file descriptor can race with each other and
lead to list corruptions or use after free.
Protect the context for these operations with a seperate lock.
The wait queue lock cannot be reused for this because that would create a
lock inversion scenario vs. the cancel lock. Replacing might_cancel with an
atomic (atomic_t or atomic bit) does not help either because it still can
race vs. the actual list operation.
Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "linux-fsdevel@vger.kernel.org" Cc: syzkaller <syzkaller@googlegroups.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701311521430.3457@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 1e38da300e1e395a15048b0af1e5305bd91402f6)
brcmfmac: fix possible buffer overflow in brcmf_cfg80211_mgmt_tx()
The lower level nl80211 code in cfg80211 ensures that "len" is between
25 and NL80211_ATTR_FRAME (2304). We subtract DOT11_MGMT_HDR_LEN (24) from
"len" so thats's max of 2280. However, the action_frame->data[] buffer is
only BRCMF_FIL_ACTION_FRAME_SIZE (1800) bytes long so this memcpy() can
overflow.
Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com> Reviewed-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
cfg80211.c is in a different directory.
Conflicts:
drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
The ahash API modifies the request's callback function in order
to clean up after itself in some corner cases (unaligned final
and missing finup).
When the request is complete ahash will restore the original
callback and everything is fine. However, when the request gets
an EBUSY on a full queue, an EINPROGRESS callback is made while
the request is still ongoing.
In this case the ahash API will incorrectly call its own callback.
This patch fixes the problem by creating a temporary request
object on the stack which is used to relay EINPROGRESS back to
the original completion function.
This patch also adds code to preserve the original flags value.
Fixes: ab6bf4e5e5e4 ("crypto: hash - Fix the pointer voodoo in...") Cc: <stable@vger.kernel.org> Reported-by: Sabrina Dubroca <sd@queasysnail.net> Tested-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit ef0579b64e93188710d48667cb5e014926af9f1b) Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
xen/mmu: Call xen_cleanhighmap() with 4MB aligned for page tables mapping
When bootup a PVM guest with large memory(Ex.240GB), XEN provided initial
mapping overlaps with kernel module virtual space. When mapping in this space
is cleared by xen_cleanhighmap(), in certain case there could be an 2MB mapping
left. This is due to XEN initialize 4MB aligned mapping but xen_cleanhighmap()
finish at 2MB boundary.
When module loading is just on top of the 2MB space, got below warning:
With the addition of hugetlbfs support in memfd_create, the memfd
selftests should verify correct functionality with hugetlbfs.
Instead of writing a separate memfd hugetlbfs test, modify the
memfd_test program to take an optional argument 'hugetlbfs'. If the
hugetlbfs argument is specified, basic memfd_create functionality will
be exercised on hugetlbfs. If hugetlbfs is not specified, the current
functionality of the test is unchanged.
Note that many of the tests in memfd_test test file sealing operations.
hugetlbfs does not support file sealing, therefore for hugetlbfs all
sealing related tests are skipped.
In order to test on hugetlbfs, there needs to be preallocated huge
pages. A new script (run_tests) is added. This script will first run
the existing memfd_create tests. It will then, attempt to allocate the
required number of huge pages before running the hugetlbfs test. At the
end of testing, it will release any huge pages allocated for testing
purposes.
Link: http://lkml.kernel.org/r/1502495772-24736-3-git-send-email-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1f522a4856600ac579765b729178f2b3b6a69129) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
tools/testing/selftests/memfd/Makefile
This patch came out of discussions in this e-mail thread:
http://lkml.kernel.org/r/1499357846-7481-1-git-send-email-mike.kravetz%40oracle.com
The Oracle JVM team is developing a new garbage collection model. This
new model requires multiple mappings of the same anonymous memory. One
straight forward way to accomplish this is with memfd_create. They can
use the returned fd to create multiple mappings of the same memory.
The JVM today has an option to use (static hugetlb) huge pages. If this
option is specified, they would like to use the same garbage collection
model requiring multiple mappings to the same memory. Using hugetlbfs,
it is possible to explicitly mount a filesystem and specify file paths
in order to get an fd that can be used for multiple mappings. However,
this introduces additional system admin work and coordination.
Ideally they would like to get a hugetlbfs fd without requiring explicit
mounting of a filesystem. Today, mmap and shmget can make use of
hugetlbfs without explicitly mounting a filesystem. The patch adds this
functionality to memfd_create.
Add a new flag MFD_HUGETLB to memfd_create() that will specify the file
to be created resides in the hugetlbfs filesystem. This is the generic
hugetlbfs filesystem not associated with any specific mount point. As
with other system calls that request hugetlbfs backed pages, there is
the ability to encode huge page size in the flag arguments.
hugetlbfs does not support sealing operations, therefore specifying
MFD_ALLOW_SEALING with MFD_HUGETLB will result in EINVAL.
Of course, the memfd_man page would need updating if this type of
functionality moves forward.
Link: http://lkml.kernel.org/r/1502149672-7759-2-git-send-email-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 749df87bd7bee5a79cef073f5d032ddb2b211de8) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
A non-default huge page size can be encoded in the flags argument of the
mmap system call. The definitions for these encodings are in arch
specific header files. However, all architectures use the same values.
Consolidate all the definitions in the primary user header file
(uapi/linux/mman.h). Include definitions for all known huge page sizes.
Use the generic encoding definitions in hugetlb_encode.h as the basis
for these definitions.
Link: http://lkml.kernel.org/r/1501527386-10736-3-git-send-email-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit aafd4562dfee81a40ba21b5ea3cf5e06664bc7f6) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
arch/alpha/include/uapi/asm/mman.h
arch/mips/include/uapi/asm/mman.h
arch/parisc/include/uapi/asm/mman.h
arch/x86/include/uapi/asm/mman.h
arch/xtensa/include/uapi/asm/mman.h
include/uapi/asm-generic/mman-common.h
Patch series "Consolidate system call hugetlb page size encodings".
These patches are the result of discussions in
https://lkml.org/lkml/2017/3/8/548. The following changes are made in the
patch set:
1) Put all the log2 encoded huge page size definitions in a common
header file. The idea is have a set of definitions that can be use as
the basis for system call specific definitions such as MAP_HUGE_* and
SHM_HUGE_*.
2) Remove MAP_HUGE_* definitions in arch specific files. All these
definitions are the same. Consolidate all definitions in the primary
user header file (uapi/linux/mman.h).
3) Remove SHM_HUGE_* definitions intended for user space from kernel
header file, and add to user (uapi/linux/shm.h) header file. Add
definitions for all known huge page size encodings as in mmap.
This patch (of 3):
If hugetlb pages are requested in mmap or shmget system calls, a huge
page size other than default can be requested. This is accomplished by
encoding the log2 of the huge page size in the upper bits of the flag
argument. asm-generic and arch specific headers all define the same
values for these encodings.
Put common definitions in a single header file. The primary uapi header
files for mmap and shm will use these definitions as a basis for
definitions specific to those system calls.
Link: http://lkml.kernel.org/r/1501527386-10736-2-git-send-email-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e652f694598273c5d749687032d1534a30e6a3a5) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
RDS: IB: Change the proxy qp's path_mtu to IB_MTU_256
The path_mtu of proxy qp of RDS is currently set to IB_MTU_4096, but it
doesn't have much relevance, since the proxy qp is used only for
registration and invalidation of MRs. For the proxy qp to work in most
environments, this patch changes the path_mtu to IB_MTU_256.
This gets rid of the horrible notion of having that
struct inode *ptmx_inode
be the linchpin of the interface between the pty code and devpts.
By de-emphasizing the ptmx inode, a lot of things actually get cleaner,
and we will have a much saner way forward. In particular, this will
allow us to associate with any particular devpts instance at open-time,
and not be artificially tied to one particular ptmx inode.
The patch itself is actually fairly straightforward, and apart from some
locking and return path cleanups it's pretty mechanical:
- the interfaces that devpts exposes all take "struct pts_fs_info *"
instead of "struct inode *ptmx_inode" now.
NOTE! The "struct pts_fs_info" thing is a completely opaque structure
as far as the pty driver is concerned: it's still declared entirely
internally to devpts. So the pty code can't actually access it in any
way, just pass it as a "cookie" to the devpts code.
- the "look up the pts fs info" is now a single clear operation, that
also does the reference count increment on the pts superblock.
So "devpts_add/del_ref()" is gone, and replaced by a "lookup and get
ref" operation (devpts_get_ref(inode)), along with a "put ref" op
(devpts_put_ref()).
- the pty master "tty->driver_data" field now contains the pts_fs_info,
not the ptmx inode.
- because we don't care about the ptmx inode any more as some kind of
base index, the ref counting can now drop the inode games - it just
gets the ref on the superblock.
- the pts_fs_info now has a back-pointer to the super_block. That's so
that we can easily look up the information we actually need. Although
quite often, the pts fs info was actually all we wanted, and not having
to look it up based on some magical inode makes things more
straightforward.
In particular, now that "devpts_get_ref(inode)" operation should really
be the *only* place we need to look up what devpts instance we're
associated with, and we do it exactly once, at ptmx_open() time.
The other side of this is that one ptmx node could now be associated
with multiple different devpts instances - you could have a single
/dev/ptmx node, and then have multiple mount namespaces with their own
instances of devpts mounted on /dev/pts/. And that's all perfectly sane
in a model where we just look up the pts instance at open time.
This will eventually allow us to get rid of our odd single-vs-multiple
pts instance model, but this patch in itself changes no semantics, only
an internal binding model.
Cc: Eric Biederman <ebiederm@xmission.com> Cc: Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Aurelien Jarno <aurelien@aurel32.net> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> Cc: Jann Horn <jann@thejh.net> Cc: Greg KH <greg@kroah.com> Cc: Jiri Slaby <jslaby@suse.com> Cc: Florian Weimer <fw@deneb.enyo.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 67245ff332064c01b760afa7a384ccda024bfd24)
Signed-off-by: Maran Wilson <maran.wilson@oracle.com> Reviewed-by: Wim ten Have <wim.ten.have@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
drivers/tty/pty.c
fs/devpts/inode.c
There are two patches present in mainline that came before this one which are
still missing from UEK. They are:
1) pty: Remove pty_unix98_shutdown()
responsible for the conflict in drivers/tty/pty.c
2) devpts: if initialization failed, don't crash when opening /dev/ptmx
responsible for the conflict in fs/devpts/inode.c
Neither seemed like they were critical enough nor directly tied to the patch
I wanted, to justify pulling them along for the ride. So intead, I manually
resolved the conflicting chunks of code, applying only the deltas that were
related to "devpts: clean up interface to pty drivers" in a way that makes
sense for that particular patch.
Neal Cardwell [Mon, 25 Jan 2016 22:01:53 +0000 (14:01 -0800)]
tcp: fix tcp_mark_head_lost to check skb len before fragmenting
This commit fixes a corner case in tcp_mark_head_lost() which was
causing the WARN_ON(len > skb->len) in tcp_fragment() to fire.
tcp_mark_head_lost() was assuming that if a packet has
tcp_skb_pcount(skb) of N, then it's safe to fragment off a prefix of
M*mss bytes, for any M < N. But with the tricky way TCP pcounts are
maintained, this is not always true.
For example, suppose the sender sends 4 1-byte packets and have the
last 3 packet sacked. It will merge the last 3 packets in the write
queue into an skb with pcount = 3 and len = 3 bytes. If another
recovery happens after a sack reneging event, tcp_mark_head_lost()
may attempt to split the skb assuming it has more than 2*MSS bytes.
This sounds very counterintuitive, but as the commit description for
the related commit c0638c247f55 ("tcp: don't fragment SACKed skbs in
tcp_mark_head_lost()") notes, this is because tcp_shifted_skb()
coalesces adjacent regions of SACKed skbs, and when doing this it
preserves the sum of their packet counts in order to reflect the
real-world dynamics on the wire. The c0638c247f55 commit tried to
avoid problems by not fragmenting SACKed skbs, since SACKed skbs are
where the non-proportionality between pcount and skb->len/mss is known
to be possible. However, that commit did not handle the case where
during a reneging event one of these weird SACKed skbs becomes an
un-SACKed skb, which tcp_mark_head_lost() can then try to fragment.
The fix is to simply mark the entire skb lost when this happens.
This makes the recovery slightly more aggressive in such corner
cases before we detect reordering. But once we detect reordering
this code path is by-passed because FACK is disabled.
Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit d88270eef4b56bd7973841dd1fed387ccfa83709)
Orabug: 26646104
Conflicts:
tcp_skb_mss is not used in UEK4. Hence, skb_shinfo()
is used to get the mss size.
Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Jim Mattson [Tue, 12 Sep 2017 20:02:54 +0000 (13:02 -0700)]
kvm: nVMX: Don't allow L2 to access the hardware CR8
If L1 does not specify the "use TPR shadow" VM-execution control in
vmcs12, then L0 must specify the "CR8-load exiting" and "CR8-store
exiting" VM-execution controls in vmcs02. Failure to do so will give
the L2 VM unrestricted read/write access to the hardware CR8.
This fixes CVE-2017-12154.
Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 51aa68e7d57e3217192d88ce90fd5b8ef29ec94f)
OraBug: 26868769 CVE-2017-12154 kvm: nVMX: L2 guest could access hardware(L0) CR8 register Tested-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The SDT stub function is used during the kernel boot process (prior to
the patching of SDT probe points). Since it is used for both regular
SDT probes and is-enabled SDT probes, it should return 0 to be a no-op
before call patching takes place.
Orabug: 26909775 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
Wei Wang [Thu, 18 May 2017 18:22:33 +0000 (11:22 -0700)]
tcp: initialize rcv_mss to TCP_MIN_MSS instead of 0
When tcp_disconnect() is called, inet_csk_delack_init() sets
icsk->icsk_ack.rcv_mss to 0.
This could potentially cause tcp_recvmsg() => tcp_cleanup_rbuf() =>
__tcp_select_window() call path to have division by 0 issue.
So this patch initializes rcv_mss to TCP_MIN_MSS instead of 0.
Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 499350a5a6e7512d9ed369ed63a4244b6536f4f8)
Sabrina Dubroca [Wed, 3 May 2017 14:43:19 +0000 (16:43 +0200)]
xfrm: fix stack access out of bounds with CONFIG_XFRM_SUB_POLICY
When CONFIG_XFRM_SUB_POLICY=y, xfrm_dst stores a copy of the flowi for
that dst. Unfortunately, the code that allocates and fills this copy
doesn't care about what type of flowi (flowi, flowi4, flowi6) gets
passed. In multiple code paths (from raw_sendmsg, from TCP when
replying to a FIN, in vxlan, geneve, and gre), the flowi that gets
passed to xfrm is actually an on-stack flowi4, so we end up reading
stuff from the stack past the end of the flowi4 struct.
Since xfrm_dst->origin isn't used anywhere following commit ca116922afa8 ("xfrm: Eliminate "fl" and "pol" args to
xfrm_bundle_ok()."), just get rid of it. xfrm_dst->partner isn't used
either, so get rid of that too.
Fixes: 9d6ec938019c ("ipv4: Use flowi4 in public route lookup interfaces.") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
(cherry picked from commit 9b3eb54106cf6acd03f07cf0ab01c13676a226c2)
David Howells [Wed, 14 Jun 2017 23:12:24 +0000 (00:12 +0100)]
rxrpc: Fix several cases where a padded len isn't checked in ticket decode
This fixes CVE-2017-7482.
When a kerberos 5 ticket is being decoded so that it can be loaded into an
rxrpc-type key, there are several places in which the length of a
variable-length field is checked to make sure that it's not going to
overrun the available data - but the data is padded to the nearest
four-byte boundary and the code doesn't check for this extra. This could
lead to the size-remaining variable wrapping and the data pointer going
over the end of the buffer.
Fix this by making the various variable-length data checks use the padded
length.
Reported-by: 石磊 <shilei-c@360.cn> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.c.dionne@auristor.com> Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(backported from commit 5f2f97656ada8d811d3c1bef503ced266fcd53a0)
Juergen Gross [Tue, 30 May 2017 18:52:26 +0000 (20:52 +0200)]
xen: don't print error message in case of missing Xenstore entry
When registering for the Xenstore watch of the node control/sysrq the
handler will be called at once. Don't issue an error message if the
Xenstore node isn't there, as it will be created only when an event
is being triggered.
(cherry picked from commit 4e93b6481c87ea5afde944a32b4908357ec58992) Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
mlx4_core: calculate log_num_mtt based on total system memory
The SR-IOV shared-port mechanism has a limitation that all the resources
and qp contexts are proxied through the PF. In order to reflect the
supported mtt entries, the log_num_mtt must be calculated based on the host
system memory rather than the privileged domain system memory. Thus, this
patch performs a Xen specific call to obtain the total memory during the PF
driver loading and uses that info to determine the size of the mtt table.
Boris Ostrovsky [Fri, 15 Sep 2017 20:23:53 +0000 (16:23 -0400)]
xen/x86: Add interface for querying amount of host memory
A driver (or some other entity in the kernel) may need to know
amount of memory available on the host. Provide the interface (for
a privileged domain() to obtain this information.
rds: Fix non-atomic operation on shared flag variable
The bits in m_flags in struct rds_message are used for a plurality of
reasons, and from different contexts. To avoid any missing updates to
m_flags, use the atomic set_bit() instead of the non-atomic equivalent.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Reviewed-by: Knut Omang <knut.omang@oracle.com> Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry-picked from upstream f530f39f5ff97209cc6f1bf66e634685954ad741)
In rds_send_xmit() there is logic to batch the sends. However, if
another thread has acquired the lock and has incremented the send_gen,
it is considered a race and we yield. The code incrementing the
s_send_lock_queue_raced statistics counter did not count this event
correctly.
This commit counts the race condition correctly.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Reviewed-by: Knut Omang <knut.omang@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry-picked from upstream 126f760ca94dae77425695f9f9238b731de86e32)
Jacob Keller [Wed, 12 Jul 2017 09:46:05 +0000 (05:46 -0400)]
i40e: use cpumask_copy instead of direct assignment
According to the header file cpumask.h, we shouldn't be directly copying
a cpumask_t, since its a bitmap and might not be copied correctly. Lets
use the provided cpumask_copy() function instead.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26822609
(cherry picked from commit 7e4d01e7d3f7d4f7b0a768a1028cb26ea06c8694) Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Tested-by: Dib Chatterjee <dib.chatterjee@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
THP defrag is enabled by default to direct reclaim/compact but not wake
kswapd in the event of a THP allocation failure. The problem is that
THP allocation requests potentially enter reclaim/compaction. This
potentially incurs a severe stall that is not guaranteed to be offset by
reduced TLB misses. While there has been considerable effort to reduce
the impact of reclaim/compaction, it is still a high cost and workloads
that should fit in memory fail to do so. Specifically, a simple
anon/file streaming workload will enter direct reclaim on NUMA at least
even though the working set size is 80% of RAM. It's been years and
it's time to throw in the towel.
First, this patch defines THP defrag as follows;
madvise: A failed allocation will direct reclaim/compact if the application requests it
never: Neither reclaim/compact nor wake kswapd
defer: A failed allocation will wake kswapd/kcompactd
always: A failed allocation will direct reclaim/compact (historical behaviour)
khugepaged defrag will enter direct/reclaim but not wake kswapd.
Next it sets the default defrag option to be "madvise" to only enter
direct reclaim/compaction for applications that specifically requested
it.
Lastly, it removes a check from the page allocator slowpath that is
related to __GFP_THISNODE to allow "defer" to work. The callers that
really cares are slub/slab and they are updated accordingly. The slab
one may be surprising because it also corrects a comment as kswapd was
never woken up by that path.
This means that a THP fault will no longer stall for most applications
by default and the ideal for most users that get THP if they are
immediately available. There are still options for users that prefer a
stall at startup of a new application by either restoring historical
behaviour with "always" or pick a half-way point with "defer" where
kswapd does some of the work in the background and wakes kcompactd if
necessary. THP defrag for khugepaged remains enabled and will enter
direct/reclaim but no wakeup kswapd or kcompactd.
After this patch a THP allocation failure will quickly fallback and rely
on khugepaged to recover the situation at some time in the future. In
some cases, this will reduce THP usage but the benefit of THP is hard to
measure and not a universal win where as a stall to reclaim/compaction
is definitely measurable and can be painful.
The first test for this is using "usemem" to read a large file and write
a large anonymous mapping (to avoid the zero page) multiple times. The
total size of the mappings is 80% of RAM and the benchmark simply
measures how long it takes to complete. It uses multiple threads to see
if that is a factor. On UMA, the performance is almost identical so is
not reported but on NUMA, we see this
For a single thread, the benchmark completes 43.23% faster with this
patch applied with smaller benefits as the thread increases. Similar,
notice the large reduction in most cases in system CPU usage. The
overall CPU time is
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
User 10357.65 10438.33
System 3988.88 3543.94
Elapsed 2203.01 1634.41
This patch eliminates almost all swapping and direct reclaim activity.
There is still overhead but it's from NUMA balancing which does not
identify that it's pointless trying to do anything with this workload.
I also tried the thpscale benchmark which forces a corner case where
compaction can be used heavily and measures the latency of whether base
or huge pages were used
The average time to fault pages is substantially reduced in the majority
of caseds but with the obvious caveat that fewer THPs are actually used
in this adverse workload
Note again that while this does swap as it's an aggressive workload, the
direct relcim activity and allocation stalls is substantially reduced.
There is some kswapd activity but ftrace showed that the kswapd activity
was due to normal wakeups from 4K pages being allocated.
Compaction-related stalls and activity are almost eliminated.
I also tried the stutter benchmark. For this, I do not have figures for
NUMA but it's something that does impact UMA so I'll report what is
available
This benchmark is trying to fault an anonymous mapping while there is a
heavy IO load -- a scenario that desktop users used to complain about
frequently. This shows a mix because the ideal case of mapping with THP
is not hit as often. However, note that 99% of the mappings complete
13.79% faster. The CPU usage here is particularly interesting
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
User 67.50 0.99
System 1327.88 91.30
Elapsed 2079.00 2128.98
Allocation stalls and all direct reclaim activity is eliminated as well
as compaction-related stalls.
THP gives impressive gains in some cases but only if they are quickly
available. We're not going to reach the point where they are completely
free so lets take the costs out of the fast paths finally and defer the
cost to kswapd, kcompactd and khugepaged where it belongs.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
This problem is due to the test aes_gcm_enc/dec test templates have actual IV
size of 13 bytes, but alg copies 16 bytes which leads to out of bound access.
The fix is to initialize the iv member to MAX_IV_SIZE.
Fixes: b824b1aa827f ("crypto: testmgr - fix out of bound read in __test_aead()") Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: John Haxby <john.haxby@oracle.com>
Nick Alcock [Wed, 6 Sep 2017 10:45:51 +0000 (11:45 +0100)]
SPEC: generate CTF when DTrace is enabled.
CTF is not yet generated for debug kernels, but this is purely because
the ctf target is unavailable because CONFIG_CTF is disabled in
debug kernels, despite with_dtrace being set. If and when CONFIG_DTRACE
(and thus CONFIG_CTF) are enabled in debug kernels, we can turn on CTF
building there without incident.
(Note: non-RPM builds are now much faster than before, since they don't
generate CTF unless you ask it to, but we cannot really avoid generating
CTF for RPM builds, since DTrace needs it. Future commits will speed up
CTF generation significantly, but for now we have to take the hit, just
as we have been before now.)
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Reviewed-by: Victor Erminpour <victor.erminpour@oracle.com>
Orabug: 25815362
Nick Alcock [Tue, 5 Sep 2017 21:39:34 +0000 (22:39 +0100)]
SPEC: bump libdtrace-ctf requirement to 0.7+.
This version includes the CTF archive support needed to build
CTF into an archive rather than linking in into modules.
It is backwardly binary-, source-, and CTF-format-compatible with
current releases (0.5, 0.6).
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Reviewed-by: Victor Erminpour <victor.erminpour@oracle.com>
Orabug: 25815362
Documentation: add watermark_scale_factor to the list of vm systcl file
Commit 795ae7a0de6b ("mm: scale kswapd watermarks in proportion to
memory") properly added the description of the new knob to
Documentation/sysctl/vm.txt, but forgot to add it to the list of files
in /proc/sys/vm. Let's fix that.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
(cherry picked from commit e6507a00fd08986ce003012a10af78cc7e47eee8)
Signed-off-by: Robert M. Harris <robert.m.harris@oracle.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Reviewed-by: Larry Bassel <larry.bassel@oracle.com> Reviewed-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Reviewed-by: Todd Vierling <todd.vierling@oracle.com>
Johannes Weiner [Thu, 17 Mar 2016 21:19:14 +0000 (14:19 -0700)]
mm: scale kswapd watermarks in proportion to memory
In machines with 140G of memory and enterprise flash storage, we have
seen read and write bursts routinely exceed the kswapd watermarks and
cause thundering herds in direct reclaim. Unfortunately, the only way
to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
system's emergency reserves - which is entirely unrelated to the
system's latency requirements. In order to get kswapd to maintain a
250M buffer of free memory, the emergency reserves need to be set to 1G.
That is a lot of memory wasted for no good reason.
On the other hand, it's reasonable to assume that allocation bursts and
overall allocation concurrency scale with memory capacity, so it makes
sense to make kswapd aggressiveness a function of that as well.
Change the kswapd watermark scale factor from the currently fixed 25% of
the tunable emergency reserve to a tunable 0.1% of memory.
Beyond 1G of memory, this will produce bigger watermark steps than the
current formula in default settings. Ensure that the new formula never
chooses steps smaller than that, i.e. 25% of the emergency reserve.
On a 140G machine, this raises the default watermark steps - the
distance between min and low, and low and high - from 16M to 143M.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 795ae7a0de6b834a0cc202aa55c190ef81496665)
Signed-off-by: Robert M. Harris <robert.m.harris@oracle.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Reviewed-by: Larry Bassel <larry.bassel@oracle.com> Reviewed-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Reviewed-by: Todd Vierling <todd.vierling@oracle.com>
Nick Alcock [Wed, 6 Sep 2017 10:45:51 +0000 (11:45 +0100)]
ctf: automate away the deduplication blacklist
The deduplication blacklist in scripts/dwarf2ctf/dedup.blacklist is a
great bit kludge. It contains a list of modules that cannot be
deduplicated because they contain structures which are defined in the
same location in different ways different kernel modules (usually
because the structure is modified by preprocessor conditionals). But
augmenting the blacklist is a pig, involving lots of poring over
debugging output to find the structure to focus on.
So automate the problem away, by augmenting type IDs for structures with
the sizeof() the structure in a new component (separated from the others
by //, a component invalid in POSIX pathnames, as usual). Helpfully
this is made available to us in the DW_AT_byte_size attribute, so it's
fast to obtain. (The component is optional because opaque structure
declarations obviously cannot include it.)
We adjust the one place that transforms transparent structure IDs into
opaque ones to take this tag into account.
This will still break for structures that are modified by preprocessor
conditionals in such a way that one member is replaced by another with a
different type but which has the same size as the one it replaces
(perhaps one pointer to a structure being replaced by a pointer to a
different structure), but in the interests of dwarf2ctf performance I'm
avoiding solving this for now, since we are not hitting it, and solving
it would require annotating structure IDs with some sort of hash of
their member names: the overhead of recursing over all members every
time we get an ID for a structure seems likely to be quite high, given
how often we look up type_id()s.
This change has no detectable effect on dwarf2ctf runtime, and shrinks
the CTF output by about 40KiB.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Orabug: 26765112
Nick Alcock [Tue, 5 Sep 2017 21:25:34 +0000 (22:25 +0100)]
ctf: drop CONFIG_DT_DISABLE_CTF, ctf.ko, and all that it implies
Now that CTF is decoupled from the kernel build and built into a
separate archive, there is no longer any need to drag around a
fake ctf.ko module to contain the shared and built-in CTF info.
Drop it, and kernel/ctf/, and the code to autoload it when
dtrace.ko is loaded, and move its Kconfig contents into
lib/Kconfig (which used to include kernel/ctf/Kconfig).
Furthermore, now that CTF is built on demand and not unconditionally
built every time the kernel is, there is no longer any need for
the speedup hack CONFIG_DT_DISABLE_CTF. Drop it.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Reviewed-by: Victor Erminpour <victor.erminpour@oracle.com>
Orabug: 25815362
Nick Alcock [Wed, 19 Jul 2017 14:44:05 +0000 (15:44 +0100)]
ctf: do not allow dwarf2ctf to run as root
This is just insanely dangerous: with the addition of the CTF_DEBUGDIR
info it reads almost arbitrary DWARF. elfutils is not root-rated and
frankly neither is dwarf2ctf, valgrind or no valgrind. It's just too
complicated to risk that way.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Orabug: 25815362
Nick Alcock [Wed, 19 Jul 2017 14:34:14 +0000 (15:34 +0100)]
ctf: decouple CTF building from the kernel build
This change causes CTF types for the core kernel and modules to be
generated only when the new 'ctf' make target is invoked. The CTF content
is emitted into a CTF archive with the default name of vmlinux.ctfa (the
name read by DTrace userspace): this can be changed via the CTF_FILENAME
makefile variable. If 'make ctf' has been run, 'make modules_install'
will install the generated CTF archive into the appropriate place. (If
CTF_FILENAME was specified on the 'make ctf' line, it needs to be passed
to 'make modules_install' as well for this to work.)
The existing link-into-modules machinery is still used for out-of-tree
modules, since these obviously cannot be visible when the vmlinux.ctfa
is built.
Usually the ctf target is invoked by kernel-uek.spec, but it can also
be invoked by developers if they know they have changed type or global
variable info while developing and would like DTrace to be able to
introspect the new data, or if they are building a kernel for the
first time and would like DTrace to be able to see its types at all.
(The archive format is fairly robust: you can often just copy
vmlinux.ctfa from one kernel to another, and types that have not
changed will continue to work with the new kernel.)
This depends on new machinery in libdtrace-ctf 0.7 or higher.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com> Reviewed-by: Victor Erminpour <victor.erminpour@oracle.com>
Orabug: 25815362
Martin K. Petersen [Fri, 11 Aug 2017 04:19:43 +0000 (00:19 -0400)]
oracleasm: Copy the integrity descriptor
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Divya Indi <divya.indi@oracle.com>
The original code made assumptions about the oracleasm integrity
descriptor hanging off of check_asm_ioc being mapped. Make sure we
properly copy and validate the descriptor before use.
RDS: IB: Add proxy qp to support FRWR through RDS_GET_MR
MR registration requested through RDS_GET_MR socket option will not have
any connection details. So, there isn't an appropriate qp to post the
registration/invalidation requests. This patch solves that issue by
using a proxy qp.
Avinash Repaka [Thu, 17 Aug 2017 21:02:47 +0000 (14:02 -0700)]
RDS: Add support for fast registration work request
This patch adds support for MR registration through work request in RDS,
commonly referred as FRWR/fastreg/FRMR.
With this patch added, RDS chooses the registration method, between FMR
and FRWR, based on the preference given through 'prefer_frwr' module
parameter and the support offered by the underlying device.
Please note that this patch is adding support for MR registration done
only through CMSG. Support for registrations through RDS_GET_MR socket
option will be added through another patch.
[qed_sp_iscsi_func_start:189(host_7-0)]Cannot satisfy CQ amount. Queues
requested 8, CQs available 4. Aborting function start
Above condition will resolve as management firmware is capable of
telling us the number of CQs available for a given PF, qed will
communicate the same number to qedi, So that qedi will know how much CQs
are allowed.
Signed-off-by: Manish Rangankar <manish.rangankar@cavium.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Trivial fix to spelling mistake in QEDF_ERR message. I should have also
included this in a previous fix, but I only just spotted this one.
Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Manish Rangankar <Manish.Rangankar@cavium.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
qedi uses iscsi_boot_sysfs to export the targets used for boot to
sysfs. Select the config option to make sure the module is built.
This addresses the compile time issue,
drivers/scsi/qedi/qedi_main.o: In function `qedi_remove':
qedi_main.c:(.text+0x3bbd): undefined reference to `iscsi_boot_destroy_kset'
drivers/scsi/qedi/qedi_main.o: In function `__qedi_probe.constprop.0':
qedi_main.c:(.text+0x577a): undefined reference to `iscsi_boot_create_target'
qedi_main.c:(.text+0x5807): undefined reference to `iscsi_boot_create_target'
qedi_main.c:(.text+0x587f): undefined reference to `iscsi_boot_create_initiator'
qedi_main.c:(.text+0x58f3): undefined reference to `iscsi_boot_create_ethernet'
qedi_main.c:(.text+0x5927): undefined reference to `iscsi_boot_destroy_kset'
qedi_main.c:(.text+0x5d7b): undefined reference to `iscsi_boot_create_host_kset'
[mkp: fixed whitespace]
Signed-off-by: Nilesh Javali <nilesh.javali@cavium.com> Fixes: c57ec8fb7c02 ("scsi: qedi: Add support for Boot from SAN over iSCSI offload") Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
This patch adds support for Boot from SAN over iSCSI offload. The iSCSI
boot information in the NVRAM is populated under
/sys/firmware/iscsi_bootX/ using qed NVM-image reading API and further
exported to open-iscsi to perform iSCSI login enabling boot over offload
iSCSI interface in a Boot from SAN environment.
Signed-off-by: Arun Easi <arun.easi@cavium.com> Signed-off-by: Andrew Vasquez <andrew.vasquez@cavium.com> Signed-off-by: Manish Rangankar <manish.rangankar@cavium.com> Signed-off-by: Nilesh Javali <nilesh.javali@cavium.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Manish Rangankar <manish.rangankar@cavium.com> Reviewed-by: Lee Duncan <lduncan@suse.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Manish Rangankar <manish.rangankar@cavium.com> Reviewed-by: Lee Duncan <lduncan@suse.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Variable idx is defined as u16 thus statement (idx < 0) is always false
and should be removed.
Signed-off-by: Christos Gkekas <chris.gekas@gmail.com> Acked-by: Manish Rangankar <Manish.Rangankar@cavium.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
We shouldn't be writing over the "ret" variable. It means we return
ERR_PTR(0) which is NULL and it results in a NULL dereference in the
caller.
Fixes: ace7f46ba5fd ("scsi: qedi: Add QLogic FastLinQ offload iSCSI driver framework.") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Manish Rangankar <manish.rangankar@cavium.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
max_fin_rt is the maximum re-transmission of FIN packets
as part of the termination flow. After reaching this value
the FW will send a single RESET.
Signed-off-by: Nilesh Javali <nilesh.javali@cavium.com> Signed-off-by: Manish Rangankar <manish.rangankar@cavium.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Manish Rangankar <manish.rangankar@cavium.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Manish Rangankar <manish.rangankar@cavium.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Manish Rangankar <manish.rangankar@cavium.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Manish Rangankar <manish.rangankar@cavium.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
munmap done by iscsiuio during a stop of the service triggers a "bad
pte" warning sometimes. munmap kernel path goes through the mmapped
pages and has a validation check for mapcount (in struct page) to be
zero or above. kzalloc, which we had used to allocate udev->ctrl, uses
slab allocations, which re-uses mapcount (union) for other purposes that
can make the mapcount look negative. Avoid all these trouble by invoking
one of the __get_free_pages wrappers to be used instead of kzalloc for
udev->ctrl.
Signed-off-by: Arun Easi <arun.easi@cavium.com> Signed-off-by: Manish Rangankar <manish.rangankar@cavium.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Trivial fix to spelling mistake in DP_NOTICE message
Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
This patch adds support for adding and deleting rx flow
classification rules. Using this user can classify RX flow
constituting of TCP/UDP 4-tuples [src_ip/dst_ip and src_port/dst_port]
to be steered on a given RX queue
Signed-off-by: Manish Chopra <manish.chopra@cavium.com> Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
The option "h" (host order ) exists for ipv4 only.
Remove the h when printing ipv6 addresses.
Lead to the following smatch warning:
drivers/net/ethernet/qlogic/qed/qed_iwarp.c:585 qed_iwarp_print_tcp_ramrod()
warn: '%pI6' can only be followed by c
drivers/net/ethernet/qlogic/qed/qed_iwarp.c:1521 qed_iwarp_print_cm_info()
warn: '%pI6' can only be followed by c
Fixes commit 456a584947d5 ("qed: iWARP CM add passive side connect")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[ Upstream commit 91d1ae475b9833097e078c2581c9265d033cdbe4 ] Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
This patch takes care of active/passive disconnect flows.
Disconnect flows can be initiated remotely, in which case a async event
will arrive from peer and indicated to qedr driver. These
are referred to as exceptions. When a QP is destroyed, it needs to check
that it's associated ep has been closed.
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[ Upstream commit fc4c6065e661224df3db50780219ac53fee56e2b ] Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
This patch implements the active side connect.
Offload a connection, process MPA reply and send RTR.
In some of the common passive/active functions, the active side
will work in blocking mode.
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[ Upstream commit 4b0fdd7c8b757125ac7996617d914bbdb9e0348c ] Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
This patch implements the passive side connect.
It addresses pre-allocating resources, creating a connection
element upon valid SYN packet received. Calling upper layer and
implementation of the accept/reject calls.
Error handling is not part of this patch.
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[ Upstream commit 456a584947d5b92d5e5a62cc68125ab5f150aa8c ] Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
This patch adds the ability to add and remove listeners and identify
whether the SYN packet received is intended for iWARP or not. If
a listener is not found the SYN packet is posted back to the chip.
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[ Upstream commit 65a91a6cdb868a28b919ca133c0f9d9dfd9a635a ] Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
iWARP handles incoming SYN packets using the ll2 interface. This patch
implements ll2 setup and teardown. Additional ll2 connections will
be used in the future which are not part of this patch series.
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[ Upstream commit b5c29ca7dab75f29a7df6e82285742f830d8ed1a ] Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
This patch adds iWARP support for flows that have common code
between RoCE and iWARP, such as initialization, teardown and
qp setup verbs: create, destroy, modify, query.
It introduces the iWARP specific files qed_iwarp.[ch] and
iwarp_common.h
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[ Upstream commit 67b40dccc45ff5d488aad17114e80e00029fd854 ] Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
iWARP personality introduced the need for differentiating in several
places in the code whether we are RoCE, iWARP or either. This
leads to introducing new macros for querying the personality.
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[ Upstream commit c851a9dc4359c6b19722de568e9f543c1c23481c ] Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Once we have iWARP support, the qede portion of the qedr<->qede would
serve all the RDMA protocols - so rename the file to be appropriate
to its function.
While we're at it, we're also moving a couple of inclusions to it into
.h files and adding includes to make sure it contains all type
definitions it requires.
Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[ Upstream commit b262a06e642cfb1eeb6c2c772f76dad674ada57e ] Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
The p_l2_info->pp_qid_usage[] array has "p_l2_info->queues" elements so
the > here should be a >= or we write beyond the end of the array.
Fixes: bbe3f233ec5e ("qed: Assign a unique per-queue index to queue-cid") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[ Upstream commit 0331402aeaefe858709b0a4d44ade15f82d3a119 ] Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>