]> www.infradead.org Git - users/jedix/linux-maple.git/log
users/jedix/linux-maple.git
7 years agosched/rt: Minimize rq->lock contention in do_sched_rt_period_timer() v4.1.12-102.0.20170530_1700
Dave Kleikamp [Mon, 15 May 2017 19:14:13 +0000 (14:14 -0500)]
sched/rt: Minimize rq->lock contention in do_sched_rt_period_timer()

With CONFIG_RT_GROUP_SCHED=y, do_sched_rt_period_timer() sequentially
takes each CPU's rq->lock. On a large, busy system, the cumulative time it
takes to acquire each lock can be excessive, even triggering a watchdog
timeout.

If rt_rq->rt_time and rt_rq->rt_nr_running are both zero, this function does
nothing while holding the lock, so don't bother taking it at all.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a767637b-df85-912f-ba69-c90ee00a3fb6@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Orabug: 25491970

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
7 years agosparc64: cache_line_size() returns larger value for cache line size.
chris hyser [Thu, 18 May 2017 18:18:33 +0000 (12:18 -0600)]
sparc64: cache_line_size() returns larger value for cache line size.

SPARC currently returns L1 data cache line size (as low as 32 bytes on
some systems) though L2 and L3 cache line sizes may be higher.  As
cache_line_size() is used by code to align memory requests to prevent
unnecessary cache line sharing, this patch returns the max of L2 and L3
sizes, currently 64 bytes.

OraBug: 26045057

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
7 years agosparc64: fix inconsistent printing of handles in debug messages
Menno Lageman [Tue, 2 May 2017 10:23:00 +0000 (06:23 -0400)]
sparc64: fix inconsistent printing of handles in debug messages

Most debug messages print handles using "%llx" but some use "%llu". Use
"%llx" for all debug messages that print handles.

Signed-off-by: Menno Lageman <menno.lageman@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Aaron Young <aaron.young@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
7 years agosparc64: set the ISCNTRLD bit for SP service handles
Menno Lageman [Tue, 2 May 2017 09:53:53 +0000 (05:53 -0400)]
sparc64: set the ISCNTRLD bit for SP service handles

Service handles generated by the ds driver can collide with service handles
generated by the SP, causing failures with Domain Services on the SP such
as 'ldom_req_sp_token: set-token failed: no reply' errors.

Ensure that service handles generated by the ds driver do not collide
with service handles generated by the SP by setting the ISCNTRLD bit in
the lower half of the service handle for SP Domain Services. This is
similar to what Solaris does.

Orabug: 25983868

Signed-off-by: Menno Lageman <menno.lageman@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Aaron Young <aaron.young@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
7 years agosparc64: DAX recursive lock removed
Rob Gardner [Fri, 19 May 2017 01:14:06 +0000 (19:14 -0600)]
sparc64: DAX recursive lock removed

At some point in the past, the call to get_user_pages() was changed to
get_user_pages_fast(). The former requires that mmap_sem be held when
making the call, which the driver respected. But the latter requires that
mmap_sem not be held, since it acquires it later. So mmap_sem was being
acquired by the driver, then again in get_user_pages_fast().  In between
these two acquisitions, another thread can come along and call mmap(),
which will wait on the same semaphore, and deadlock with the subsequent
get_user_pages_fast() attempt to get it again.

  Thread 1 Thread 2
  -------- --------
  acquire mmap_sem    .
  call get_user_pages_fast()    .
     . mmap()
     .   acquire mmap_sem (blocks)
     acquire mmap_sem (blocks)

Since get_user_pages_fast() acquires mmap_sem, the dax driver should
not do so.

Orabug: 26103487

Signed-off-by: Rob Gardner <rob.gardner@oracle.com>
Reviewed-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Reviewed-by: Eric Saint-Etienne <eric.saint.etienne@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
7 years agosparc/ftrace: Fix ftrace graph time measurement
Liam R. Howlett [Wed, 17 May 2017 15:47:00 +0000 (11:47 -0400)]
sparc/ftrace: Fix ftrace graph time measurement

The ftrace function_graph time measurements of a given function is not
accurate according to those recorded by ftrace using the function
filters.  This change pulls the x86_64 fix from 'commit 722b3c746953
("ftrace/graph: Trace function entry before updating index")' into the
sparc specific prepare_ftrace_return which stops ftrace from
counting interrupted tasks in the time measurement.

Example measurements for select_task_rq_fair running "hackbench 100
process 1000":

              |  tracing/trace_stat/function0  |  function_graph
 Before patch |  2.802 us                      |  4.255 us
 After patch  |  2.749 us                      |  3.094 us

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(Cherry picked from commit 48078d2dac0a26f84f5f3ec704f24f7c832cce14)

Note: Upstream fix needed an extra parameter of NULL for
prepare_ftrace_return.

Orabug: 25995351

Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
7 years agosparc64: Increase max_phys_bits to 51 for M8.
Vijay Kumar [Thu, 20 Apr 2017 19:29:49 +0000 (13:29 -0600)]
sparc64: Increase max_phys_bits to 51 for M8.

On M8 chips, use a max_phys_bits value of 51 and 54 bits for
virtual address.

Orabug: 25808647

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
7 years agosparc64: 5-Level page table support for sparc
Vijay Kumar [Thu, 20 Apr 2017 17:00:58 +0000 (11:00 -0600)]
sparc64: 5-Level page table support for sparc

Extended Page table to 5-Level for sparc.

Orabug: 26076110
Orabug: 25808647

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
7 years agomm, gup: fix typo in gup_p4d_range()
Kirill A. Shutemov [Mon, 13 Mar 2017 05:22:13 +0000 (08:22 +0300)]
mm, gup: fix typo in gup_p4d_range()

gup_p4d_range() should call gup_pud_range(), not itself.

[ This was not noticed on x86: this is the HAVE_GENERIC_RCU_GUP code
  used by arm[64] and powerpc    - Linus ]

Fixes: c2febafc6773 ("mm: convert generic code to 5-level paging")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Reported-by: Anton Blanchard <anton@samba.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit ce70df089143c49385b4f32f39d41fb50fbf6a7c)

Orabug: 25808647

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
7 years agomm: introduce __p4d_alloc()
Kirill A. Shutemov [Thu, 9 Mar 2017 14:24:08 +0000 (17:24 +0300)]
mm: introduce __p4d_alloc()

For full 5-level paging we need a helper to allocate p4d page table.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 90eceff1a375f6ffa78caf8654e787c0a8a591ef)

Orabug: 25808647

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
7 years agomm: convert generic code to 5-level paging
Vijay Kumar [Thu, 20 Apr 2017 00:11:00 +0000 (18:11 -0600)]
mm: convert generic code to 5-level paging

Convert all non-architecture-specific code to 5-level paging.

It's mostly mechanical adding handling one more page table level in
places where we deal with pud_t.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c2febafc67734a62196c1b9dfba926412d4077ba)

Conflicts:

include/linux/kasan.h
mm/kasan/kasan_init.c
mm/memory.c
mm/page_vma_mapped.c

Orabug: 25808647

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
7 years agoasm-generic: introduce <asm-generic/pgtable-nop4d.h>
Vijay Kumar [Wed, 19 Apr 2017 22:03:45 +0000 (16:03 -0600)]
asm-generic: introduce <asm-generic/pgtable-nop4d.h>

Like with pgtable-nopud.h for 4-level paging, this new header is base
for converting an architectures to properly folded p4d_t level.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 048456dcf2c56ad6f6248e2899dda92fb6a613f6)

Conflicts:

include/asm-generic/tlb.h

Orabug: 25808647

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
7 years agoarch, mm: convert all architectures to use 5level-fixup.h
Vijay Kumar [Wed, 19 Apr 2017 21:59:24 +0000 (15:59 -0600)]
arch, mm: convert all architectures to use 5level-fixup.h

If an architecture uses 4level-fixup.h we don't need to do anything as
it includes 5level-fixup.h.

If an architecture uses pgtable-nop*d.h, define __ARCH_USE_5LEVEL_HACK
before inclusion of the header. It makes asm-generic code to use
5level-fixup.h.

If an architecture has 4-level paging or folds levels on its own,
include 5level-fixup.h directly.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 9849a5697d3defb2087cb6b9be5573a142697889)

Conflicts:

arch/arc/include/asm/hugepage.h
arch/h8300/include/asm/pgtable.h
arch/mips/include/asm/pgtable-64.h
arch/powerpc/include/asm/book3s/32/pgtable.h
arch/powerpc/include/asm/book3s/64/pgtable.h
arch/powerpc/include/asm/nohash/32/pgtable.h
arch/powerpc/include/asm/nohash/64/pgtable-4k.h
arch/powerpc/include/asm/nohash/64/pgtable-64k.h

Orabug: 25808647

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
7 years agoasm-generic: introduce __ARCH_USE_5LEVEL_HACK
Kirill A. Shutemov [Thu, 9 Mar 2017 14:24:04 +0000 (17:24 +0300)]
asm-generic: introduce __ARCH_USE_5LEVEL_HACK

We are going to introduce <asm-generic/pgtable-nop4d.h> to provide
abstraction for properly (in opposite to 5level-fixup.h hack) folded
p4d level. The new header will be included from pgtable-nopud.h.

If an architecture uses <asm-generic/nop*d.h>, we cannot use
5level-fixup.h directly to quickly convert the architecture to 5-level
paging as it would conflict with pgtable-nop4d.h.

With this patch an architecture can define __ARCH_USE_5LEVEL_HACK before
inclusion <asm-genenric/nop*d.h> to use 5level-fixup.h.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 30ec842660bd0d056d4a7028ac5bd4a82b113d4f)

Orabug: 25808647

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
7 years agoasm-generic: introduce 5level-fixup.h
Kirill A. Shutemov [Thu, 9 Mar 2017 14:24:03 +0000 (17:24 +0300)]
asm-generic: introduce 5level-fixup.h

We are going to switch core MM to 5-level paging abstraction.

This is preparation step which adds <asm-generic/5level-fixup.h>
As with 4level-fixup.h, the new header allows quickly make all
architectures compatible with 5-level paging in core MM.

In long run we would like to switch architectures to properly folded p4d
level by using <asm-generic/pgtable-nop4d.h>, but it requires more
changes to arch-specific code.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 505a60e225606fbd3d2eadc31ff793d939ba66f1)

Orabug: 25808647

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
7 years agosparc64: prevent sunvdc from sending duplicate vdisk requests
Jag Raman [Tue, 2 May 2017 19:49:34 +0000 (15:49 -0400)]
sparc64: prevent sunvdc from sending duplicate vdisk requests

prevent sunvdc from sending duplicate vdisk requests by ensuring that
inflight vdisk requests are resent before waking up suspended vdisk
threads

Orabug: 25866770

Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Reviewed-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
7 years agoldmvsw: stop the clean timer at beginning of remove
Shannon Nelson [Mon, 15 May 2017 15:33:06 +0000 (08:33 -0700)]
ldmvsw: stop the clean timer at beginning of remove

Stop the clean timer earlier to be sure there's no asynchronous
interference while stopping the port.

Orabug: 25748241

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Tushar Dave <tushar.n.dave@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
7 years agosparc64: set CONFIG_EFI in config
Eric Snowberg [Thu, 11 May 2017 00:08:56 +0000 (17:08 -0700)]
sparc64: set CONFIG_EFI in config

Orabug: 26037358

Signed-off-by: Eric Snowberg <eric.snowberg@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
7 years agosparc64: /sys/firmware/efi missing during EFI boot
Eric Snowberg [Wed, 10 May 2017 14:50:11 +0000 (07:50 -0700)]
sparc64: /sys/firmware/efi missing during EFI boot

The newest version of OBP is capable of doing an EFI boot.  When Linux
is booted thru this EFI loader, the /sys/firmware/efi directory does
not exist.  Many userspace applications, such as GRUB, check whether
the dir /sys/firmware/efi exists, if it exists it means
the kernel has booted in EFI mode.

A new Open Firmware property called efi-booter has been added
to /chosen. This new property is only present when doing an
EFI boot.

Signed-off-by: Eric Snowberg <eric.snowberg@oracle.com>
Reviewed-by Thomas Tai <thomas.tai@oracle.com>

Orabug: 26037358
Signed-off-by: Allen Pais <allen.pais@oracle.com>
7 years agoAllow default value of npools used for iommu to be configured from cmdline
Allen Pais [Fri, 2 Dec 2016 08:01:47 +0000 (13:31 +0530)]
Allow default value of npools used for iommu to be configured from cmdline

    The default value of the number of pools used by the pooled IOMMU
    allocator  in lib/iommu-common.c is a constant today (set at 16).
    It is possible that, for some platforms and some devices, the combination
    of latency and frequency of  iommu alloc/free  requests may be such
    as to trigger fragmentation within a pool, leading to iommu alloc failure.

    Reducing the number of pools (and thus increasing the pool size) can
    minimize the risk of those failures.

    This patch provides a command line hook to set the default number of
    pools at boot time.

 Ported to UEK4

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
7 years agoSPARC64: Add Linux vds driver Device ID support for Solaris guest boot
George Kennedy [Mon, 15 May 2017 14:43:56 +0000 (07:43 -0700)]
SPARC64: Add Linux vds driver Device ID support for Solaris guest boot

Currently, Solaris guest backend disk images cannot be moved from the Device ID
they were created at and still boot. This bug fix adds Solaris Device ID
support to the Linux vds driver to allow a Solaris guest backend disk image to
be moved to a different device ID from where it was created and still boot.

The Linux vds driver support added in this bug is for Solaris disk images
only. In the future, Solaris Device ID support for physical disk backends will
be added to the Linux vds driver as well.

From PSARC/1995/352:
Solaris Device IDs provide a means for identifying a device, independent of the
device's current name or device number. The instance number of a device number
may change across reconfiguration boots, changing the device number (dev_t) for
that device.  Operator errors in recabling can cause devices to swap logical
device names, introducing the potential for data loss.

Signed-off-by: George Kennedy <george.kennedy@oracle.com>
Reviewed-by: Alexandre Chartre <Alexandre.Chartre@oracle.com>
Orabug: 25836231
Signed-off-by: Allen Pais <allen.pais@oracle.com>
7 years agosparc64: Remove locking of huge pages in DAX driver
Sanath Kumar [Mon, 15 May 2017 16:31:34 +0000 (11:31 -0500)]
sparc64: Remove locking of huge pages in DAX driver

Orabug: 25968141

Some huge page virtual addresses do not work with get_user_pages. Since
the purpose of calling get_user_pages is for its locking side effect, it
is not at all necessary for huge pages since they are permanently
pinned. So the failure is avoided and the unnecessary locking/unlocking
is eliminated.

Reviewed-by: Jonathan Helman <jonathan.helman@oracle.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
7 years agoldmvsw: unregistering netdev before disable hardware
Thomas Tai [Mon, 8 May 2017 20:37:40 +0000 (13:37 -0700)]
ldmvsw: unregistering netdev before disable hardware

When running LDom binding/unbinding test, kernel may panic
in ldmvsw_open(). It is more likely that because we're removing
the ldc connection before unregistering the netdev in vsw_port_remove(),
we set up a window of time where one process could be removing the
device while another trying to UP the device. This also sometimes causes
vio handshake error due to opening a device without closing it completely.
We should unregister the netdev before we disable the "hardware".

orabug: 2598091325925306

Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
7 years agoarch/sparc: Measure receiver forward progress to avoid send mondo timeout
Jane Chu [Wed, 15 Mar 2017 21:58:46 +0000 (14:58 -0700)]
arch/sparc: Measure receiver forward progress to avoid send mondo timeout

A large sun4v SPARC system may have moments of intensive xcall activities,
usually caused by unmapping many pages on many CPUs concurrently. This can
flood receivers with CPU mondo interrupts for an extended period, causing
some unlucky senders to hit send-mondo timeout. This problem gets worse
as cpu count increases because sometimes mappings must be invalidated on
all CPUs, and sometimes all CPUs may gang up on a single CPU.

But a busy system is not a broken system. In the above scenario, as long
as the receiver is making forward progress processing mondo interrupts,
the sender should continue to retry.

This patch implements the receiver's forward progress meter by introducing
a per cpu counter 'cpu_mondo_counter[cpu]' where 'cpu' is in the range
of 0..NR_CPUS. The receiver increments its counter as soon as it receives
a mondo and the sender tracks the receiver's counter. Every 10000 retries,
if the receiver has stopped making forward progress, the sender declares
send-mondo-timeout and panic; otherwise, the receiver is allowed to keep
making forward progress.

Orabug: 25476541
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-By: Steve Sistare <steven.sistare@oracle.com>
Reviewed-By: Anthony Yznaga <anthony.yznaga@oracle.com>
Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
7 years agosparc64: update DAX submit to latest HV spec
Jonathan Helman [Mon, 1 May 2017 18:47:15 +0000 (11:47 -0700)]
sparc64: update DAX submit to latest HV spec

Orabug: 25927558

DAX submit needs to be updated to the latest HV spec. Along with a couple
small updates, the biggest modification is changing nomap_va to
status_data. This is mostly a cosmetic change but also adds support to
return the unavailable code via the exec ioctl. Further, augment the
comments and fix up a couple nits in the ccb submit hcall in hypervisor.h.

Signed-off-by: Jonathan Helman <jonathan.helman@oracle.com>
Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
7 years agoarch/sparc: increase CONFIG_NODES_SHIFT on SPARC to 5
Jane Chu [Thu, 30 Mar 2017 18:04:40 +0000 (12:04 -0600)]
arch/sparc: increase CONFIG_NODES_SHIFT on SPARC to 5

SPARC M6-32 platform has (2^5) numa nodes, so we need to bump up the
CONFIG_NODES_SHIFT to 5.

Orabug: 25577754

Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Reviewed-by: Atish Patra <atish.patra@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
7 years agoarch/sparc: support NR_CPUS = 4096
jane Chu [Wed, 22 Mar 2017 22:49:05 +0000 (16:49 -0600)]
arch/sparc: support NR_CPUS = 4096

Linux SPARC64 limits NR_CPUS to 4064 because init_cpu_send_mondo_info()
only allocates a single page for NR_CPUS mondo entries. Thus we cannot
use all 4096 CPUs on some SPARC platforms.

To fix, allocate (2^order) pages where order is set according to the size
of cpu_list for possible cpus. Since cpu_list_pa and cpu_mondo_block_pa
are not used in asm code, there are no imm13 offsets from the base PA
that will break because they can only reach one page.

Orabug: 25505750

Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Reviewed-by: Atish Patra <atish.patra@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
7 years agoipv6: catch a null skb before using it in a DTRACE
Shannon Nelson [Wed, 3 May 2017 00:17:36 +0000 (17:17 -0700)]
ipv6: catch a null skb before using it in a DTRACE

Fix a little trap set by an earlier DTRACE_IP patch.  While I was there
I checked the other similar calls and the rest look okay.

Orabug: 25973797

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-By: Jane Chu <jane.chu@oracle.com>
Reviewed-By: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
7 years agosparc64: fix fault handling in NGbzero.S and GENbzero.S
Dave Aldridge [Thu, 27 Apr 2017 09:20:18 +0000 (03:20 -0600)]
sparc64: fix fault handling in NGbzero.S and GENbzero.S

When any of the functions contained in NGbzero.S and GENbzero.S
are being run, we may end up taking a fault when executing one
of the store alternate address space instructions. If this
happens, the exception handler does not restore the %asi
register.

This commit fixes the issue by introducing a new exception
handler that ensures the %asi register is restored when
a fault is handled.

Orabug: 25577560

Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
7 years agosparc64: modify sys_dax.h for new libdax
Jonathan Helman [Thu, 27 Apr 2017 16:11:10 +0000 (09:11 -0700)]
sparc64: modify sys_dax.h for new libdax

Orabug: 25927572

Modify sys_dax.h such that new libdax can be compiled by including this
file unmodified. Userspace does not have u16, u32, etc. types defined and
as stated in Section 5e of Documentation/CodingStyle, we should be using
__u16, __u32, etc. in the ioctl structures which are exported to userspace.

Further, rename the DAXIOC_DEP_[number] ioctls and use DAXIOC_[name]_OLD
instead.

Signed-off-by: Jonathan Helman <jonathan.helman@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
7 years agobnx2x: Align RX buffers
Scott Wood [Sat, 29 Apr 2017 00:17:41 +0000 (19:17 -0500)]
bnx2x: Align RX buffers

The bnx2x driver is not providing proper alignment on the receive buffers it
passes to build_skb(), causing skb_shared_info to be misaligned.
skb_shared_info contains an atomic, and while PPC normally supports
unaligned accesses, it does not support unaligned atomics.

Aligning the size of rx buffers will ensure that page_frag_alloc() returns
aligned addresses.

This can be reproduced on PPC by setting the network MTU to 1450 (or other
non-multiple-of-4) and then generating sufficient inbound network traffic
(one or two large "wget"s usually does it), producing the following oops:

Unable to handle kernel paging request for unaligned access at address 0xc00000ffc43af656
Faulting instruction address: 0xc00000000080ef8c
Oops: Kernel access of bad area, sig: 7 [#1]
SMP NR_CPUS=2048
NUMA
PowerNV
Modules linked in: vmx_crypto powernv_rng rng_core powernv_op_panel leds_powernv led_class nfsd ip_tables x_tables autofs4 xfs lpfc bnx2x mdio libcrc32c crc_t10dif crct10dif_generic crct10dif_common
CPU: 104 PID: 0 Comm: swapper/104 Not tainted 4.11.0-rc8-00088-g4c761da #2
task: c00000ffd4892400 task.stack: c00000ffd4920000
NIP: c00000000080ef8c LR: c00000000080eee8 CTR: c0000000001f8320
REGS: c00000ffffc33710 TRAP: 0600   Not tainted  (4.11.0-rc8-00088-g4c761da)
MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
  CR: 24082042  XER: 00000000
CFAR: c00000000080eea0 DAR: c00000ffc43af656 DSISR: 00000000 SOFTE: 1
GPR00: c000000000907f64 c00000ffffc33990 c000000000dd3b00 c00000ffcaf22100
GPR04: c00000ffcaf22e00 0000000000000000 0000000000000000 0000000000000000
GPR08: 0000000000b80008 c00000ffc43af636 c00000ffc43af656 0000000000000000
GPR12: c0000000001f6f00 c00000000fe1a000 000000000000049f 000000000000c51f
GPR16: 00000000ffffef33 0000000000000000 0000000000008a43 0000000000000001
GPR20: c00000ffc58a90c0 0000000000000000 000000000000dd86 0000000000000000
GPR24: c000007fd0ed10c0 00000000ffffffff 0000000000000158 000000000000014a
GPR28: c00000ffc43af010 c00000ffc9144000 c00000ffcaf22e00 c00000ffcaf22100
NIP [c00000000080ef8c] __skb_clone+0xdc/0x140
LR [c00000000080eee8] __skb_clone+0x38/0x140
Call Trace:
[c00000ffffc33990] [c00000000080fb74] skb_clone+0x74/0x110 (unreliable)
[c00000ffffc339c0] [c000000000907f64] packet_rcv+0x144/0x510
[c00000ffffc33a40] [c000000000827b64] __netif_receive_skb_core+0x5b4/0xd80
[c00000ffffc33b00] [c00000000082b2bc] netif_receive_skb_internal+0x2c/0xc0
[c00000ffffc33b40] [c00000000082c49c] napi_gro_receive+0x11c/0x260
[c00000ffffc33b80] [d000000066483d68] bnx2x_poll+0xcf8/0x17b0 [bnx2x]
[c00000ffffc33d00] [c00000000082babc] net_rx_action+0x31c/0x480
[c00000ffffc33e10] [c0000000000d5a44] __do_softirq+0x164/0x3d0
[c00000ffffc33f00] [c0000000000d60a8] irq_exit+0x108/0x120
[c00000ffffc33f20] [c000000000015b98] __do_irq+0x98/0x200
[c00000ffffc33f90] [c000000000027f14] call_do_irq+0x14/0x24
[c00000ffd4923a90] [c000000000015d94] do_IRQ+0x94/0x110
[c00000ffd4923ae0] [c000000000008d90] hardware_interrupt_common+0x150/0x160

Orabug: 25806778
Cherry-picked from 05c0d69d7

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
7 years agoPCI: Fix unaligned accesses in VC code
David Miller [Sun, 19 Jun 2016 06:52:25 +0000 (23:52 -0700)]
PCI: Fix unaligned accesses in VC code

The save/restore buffers for VC state is first composed of a 2-byte control
register, then a bunch of 4-byte words.

This causes unaligned accesses which trap on platform such as sparc.

This is easy to fix by simply moving the buffer pointer forward by 4 bytes
instead of 2 after dealing with the control register.  The length
adjustment needs to be changed likewise as well.

Orabug: 25806778
Cherry-picked from b77b3610 PCI: Fix unaligned accesses in VC code

Fixes: 5f8fc43217a0 ("PCI: Include pci/pcie/Kconfig directly from pci/Kconfig")
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
CC: stable@vger.kernel.org # v4.6+
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
7 years agosparc64: Use LOCKDEP_SMALL, not PROVE_LOCKING_SMALL
Daniel Jordan [Fri, 28 Apr 2017 15:49:21 +0000 (08:49 -0700)]
sparc64: Use LOCKDEP_SMALL, not PROVE_LOCKING_SMALL

Orabug: 25830041

(Cherry-pick of upstream 395102db441abb8fd18fec5dd81428b5120232af)

CONFIG_PROVE_LOCKING_SMALL shrinks the memory usage of lockdep so the
kernel text, data, and bss fit in the locked TLB entries allotted for
the kernel, but this option is not set for every config that enables
lockdep.

A 4.10 kernel fails to boot with the console output

    Kernel: Using 8 locked TLB entries for main kernel image.
    hypervisor_tlb_lock[2000000:0:8000000071c007c3:1]: errors with f
    Program terminated

with these config options

    CONFIG_LOCKDEP=y
    CONFIG_LOCK_STAT=y
    CONFIG_PROVE_LOCKING=n

To fix, rename CONFIG_PROVE_LOCKING_SMALL to CONFIG_LOCKDEP_SMALL, and
enable this option with CONFIG_LOCKDEP=y so we get the reduced memory
usage every time lockdep is turned on.

Tested that CONFIG_LOCKDEP_SMALL is set to 'y' if and only if
CONFIG_LOCKDEP is set to 'y'.  When other lockdep-related config options
that select CONFIG_LOCKDEP are enabled (e.g. CONFIG_LOCK_STAT or
CONFIG_PROVE_LOCKING), verified that CONFIG_LOCKDEP_SMALL is also
enabled.

Fixes: 64740b06b7e5 ("config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
7 years agolockdep: Limit static allocations if PROVE_LOCKING_SMALL is defined
Babu Moger [Tue, 27 Sep 2016 17:47:17 +0000 (10:47 -0700)]
lockdep: Limit static allocations if PROVE_LOCKING_SMALL is defined

Reduce the size of data structure for lockdep entries by half if
PROVE_LOCKING_SMALL if defined. This is used only for sparc.

Orabug: 24736954

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
7 years agoconfig: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc
Babu Moger [Tue, 27 Sep 2016 17:05:34 +0000 (10:05 -0700)]
config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc

This new config parameter limits the space used for "Lock debugging:
prove locking correctness" by about 4MB. The current sparc systems have
the limitation of 32MB size for kernel size including .text, .data and
.bss sections. With PROVE_LOCKING feature, the kernel size could grow
beyond this limit and causing system boot-up issues. With this option,
kernel limits the size of the entries of lock_chains, stack_trace etc.
so that kernel fits in required size limit. This is not visible to user
and only used for sparc.

Orabug: 24736954

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
7 years agosparc64: fix cdev_put() use-after-free when unbinding an LDom
Thomas Tai [Thu, 27 Apr 2017 17:51:48 +0000 (10:51 -0700)]
sparc64: fix cdev_put() use-after-free when unbinding an LDom

After turning on slub_debug=P kernel option, a kernel panic happens when
unbinding an LDom. This suggests that there is memory corruption.
The memory corruption is caused by vlds_fops_release() freeing a memory
structure containing a cdev. The cdev is needed by fs/file_table.c
after the file is released.

The common approach to solve this issue is to add a kobject member
in the structure and set it to be the parent of cdev. The kobject is
then responsible to free the structure when the reference count is
zero. The reference solution is based on the following patch.

https://patchwork.kernel.org/patch/8985881/

Orabug: 25911389

Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Reviewed-By: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Tom Saeger <tom.saeger@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
7 years agosparc64: change DAX CCB_EXEC ENOBUFS print to debug
Jonathan Helman [Fri, 21 Apr 2017 00:45:56 +0000 (17:45 -0700)]
sparc64: change DAX CCB_EXEC ENOBUFS print to debug

Orabug: 25927528

The CCB_EXEC ioctl in the DAX driver returns ENOBUFS when the user must
free completion areas before the submission can succeed. There is a
dax_err() print when this condition occurs. This print should be changed to
a dax_dbg() print since this return value can be used by the caller to
trigger freeing the completion areas, hence an error print is too verbose.

Signed-off-by: Jonathan Helman <jonathan.helman@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
7 years agoDrivers: hv: kvp: fix IP Failover
Vitaly Kuznetsov [Sun, 1 May 2016 02:21:33 +0000 (19:21 -0700)]
Drivers: hv: kvp: fix IP Failover

Hyper-V VMs can be replicated to another hosts and there is a feature to
set different IP for replicas, it is called 'Failover TCP/IP'. When
such guest starts Hyper-V host sends it KVP_OP_SET_IP_INFO message as soon
as we finish negotiation procedure. The problem is that it can happen (and
it actually happens) before userspace daemon connects and we reply with
HV_E_FAIL to the message. As there are no repetitions we fail to set the
requested IP.

Solve the issue by postponing our reply to the negotiation message till
userspace daemon is connected. We can't wait too long as there is a
host-side timeout (cca. 75 seconds) and if we fail to reply in this time
frame the whole KVP service will become inactive. The solution is not
ideal - if it takes userspace daemon more than 60 seconds to connect
IP Failover will still fail but I don't see a solution with our current
separation between kernel and userspace parts.

Other two modules (VSS and FCOPY) don't require such delay, leave them
untouched.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25970637
(cherry picked from commit 4dbfc2e68004c60edab7e8fd26784383dd3ee9bc)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoDrivers: hv: util: Pass the channel information during the init call
K. Y. Srinivasan [Fri, 26 Feb 2016 23:13:19 +0000 (15:13 -0800)]
Drivers: hv: util: Pass the channel information during the init call

Pass the channel information to the util drivers that need to defer
reading the channel while they are processing a request. This would address
the following issue reported by Vitaly:

Commit 3cace4a61610 ("Drivers: hv: utils: run polling callback always in
interrupt context") removed direct *_transaction.state = HVUTIL_READY
assignments from *_handle_handshake() functions introducing the following
race: if a userspace daemon connects before we get first non-negotiation
request from the server hv_poll_channel() won't set transaction state to
HVUTIL_READY as (!channel) condition will fail, we set it to non-NULL on
the first real request from the server.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25970637
(cherry picked from commit b9830d120cbe155863399f25eaef6aa8353e767f)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoDrivers: hv: utils: run polling callback always in interrupt context
Olaf Hering [Tue, 15 Dec 2015 00:01:33 +0000 (16:01 -0800)]
Drivers: hv: utils: run polling callback always in interrupt context

All channel interrupts are bound to specific VCPUs in the guest
at the point channel is created. While currently, we invoke the
polling function on the correct CPU (the CPU to which the channel
is bound to) in some cases we may run the polling function in
a non-interrupt context. This  potentially can cause an issue as the
polling function can be interrupted by the channel callback function.
Fix the issue by running the polling function on the appropriate CPU
at interrupt level. Additional details of the issue being addressed by
this patch are given below:

Currently hv_fcopy_onchannelcallback is called from interrupts and also
via the ->write function of hv_utils. Since the used global variables to
maintain state are not thread safe the state can get out of sync.
This affects the variable state as well as the channel inbound buffer.

As suggested by KY adjust hv_poll_channel to always run the given
callback on the cpu which the channel is bound to. This avoids the need
for locking because all the util services are single threaded and only
one transaction is active at any given point in time.

Additionally, remove the context variable, they will always be the same as
recv_channel.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25970637
(cherry picked from commit 3cace4a616108539e2730f8dc21a636474395e0f)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoDrivers: hv: util: Increase the timeout for util services
K. Y. Srinivasan [Tue, 15 Dec 2015 00:01:32 +0000 (16:01 -0800)]
Drivers: hv: util: Increase the timeout for util services

Util services such as KVP and FCOPY need assistance from daemon's running
in user space. Increase the timeout so we don't prematurely terminate
the transaction in the kernel. Host sets up a 60 second timeout for
all util driver transactions. The host will retry the transaction if it
times out. Set the guest timeout at 30 seconds.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25970637
(cherry picked from commit c0b200cfb0403740171c7527b3ac71d03f82947a)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoDrivers: hv: kvp: check kzalloc return value
Vitaly Kuznetsov [Sat, 1 Aug 2015 23:08:11 +0000 (16:08 -0700)]
Drivers: hv: kvp: check kzalloc return value

kzalloc() return value check was accidentally lost in 11bc3a5fa91f:
"Drivers: hv: kvp: convert to hv_utils_transport" commit.

We don't need to reset kvp_transaction.state here as we have the
kvp_timeout_func() timeout function and in case we're in OOM situation
it is preferable to wait.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25970637
(cherry picked from commit b36fda339729a974a8838978dcdc581d8ce68fd9)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoDrivers: hv: fcopy: dynamically allocate smsg_out in fcopy_send_data()
Vitaly Kuznetsov [Sat, 1 Aug 2015 23:08:12 +0000 (16:08 -0700)]
Drivers: hv: fcopy: dynamically allocate smsg_out in fcopy_send_data()

struct hv_start_fcopy is too big to be on stack on i386, the following
warning is reported:

>> drivers/hv/hv_fcopy.c:159:1: warning: the frame size of 1088 bytes is larger than 1024 bytes [-Wframe-larger-than=]

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25970637
(cherry picked from commit 25ef06fe27a292ad33155045ef7a123be4c0b6ab)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoDrivers: hv: vss: full handshake support
Vitaly Kuznetsov [Sun, 12 Apr 2015 01:07:57 +0000 (18:07 -0700)]
Drivers: hv: vss: full handshake support

Introduce VSS_OP_REGISTER1 to support kernel replying to the negotiation
message with its own version.

Add small change to vss_handle_handshake for RH compatibility

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25970637
(cherry picked from commit cd8dc0548511efff7a97d978f989ce67a883f9a5)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoRDS/IB: 4KB receive buffers get posted by mistake on 16KB frag connections. v4.1.12-102.0.20170529_2200 v4.1.12-102.0.20170530_1300
Venkat Venkatsubra [Mon, 8 May 2017 11:23:13 +0000 (04:23 -0700)]
RDS/IB: 4KB receive buffers get posted by mistake on 16KB frag connections.

When connections are at 4KB fragments and then it moves to 16KB frags
(for example during uek2 to uek4 upgrade) we see 4KB buffers getting
posted on 16KB connections. This is happening because the 4KB buffers
(buffers from previous connection before the move to 16KB) are getting
added back to the current connection's (16KB) cache.

We will fix this by doing the following.

1) When the recv buffers get freed/released after either the application
   is done reading it or the socket gets closed (process dies, etc.)
   and RDS/IB decides to add that buffer back into the current cache,
   make sure the frag size matches with that of the current connection.

2) When recv completion reports IB_WC_LOC_LEN_ERR, mark the connection state
   as "buffers need to be rebuilt during reconnection". And at the time of
   reconnect rebuild the cache even though the "frag size of the connection"
   has not changed.

Orabug: 25920916

Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
7 years agomlx4: limit max MSIX allocations
Ajaykumar Hotchandani [Fri, 5 May 2017 19:08:32 +0000 (12:08 -0700)]
mlx4: limit max MSIX allocations

We get more than 64 MSI-X vectors from CX3 firmware 2.35.5530 onwards.
This results in in legacy mode EQ allocs after 64 EQs, which ends up
flooding 3 vectors and causing performance degradation.

With this patch, we limit max vector allocations MAX_MSIX(64).
When Mellanox driver can support more EQs without getting into legacy
mode, this patch should go away.

Orabug: 25912737

Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
7 years agosched/wait: Fix the signal handling fix
Peter Zijlstra [Sun, 13 Dec 2015 21:11:16 +0000 (22:11 +0100)]
sched/wait: Fix the signal handling fix

Jan Stancek reported that I wrecked things for him by fixing things for
Vladimir :/

His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which
should not be possible, however my previous patch made this possible by
unconditionally checking signal_pending().

We cannot use current->state as was done previously, because the
instruction after the store to that variable it can be changed.  We must
instead pass the initial state along and use that.

Fixes: 68985633bccb ("sched/wait: Fix signal handling in bit wait helpers")
Reported-by: Jan Stancek <jstancek@redhat.com>
Reported-by: Chris Mason <clm@fb.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Vladimir Murzin <vladimir.murzin@arm.com>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Paul Turner <pjt@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: tglx@linutronix.de
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: hpa@zytor.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 25908266
(cherry picked from commit dfd01f026058a59a513f8a365b439a0681b803af)
Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
7 years agosparc64: Fix mapping of 64k pages with MAP_FIXED
Nitin Gupta [Mon, 15 May 2017 22:40:51 +0000 (15:40 -0700)]
sparc64: Fix mapping of 64k pages with MAP_FIXED

An incorrect huge page alignment check caused
mmap failure for 64K pages when MAP_FIXED is used
with address not aligned to HPAGE_SIZE.

Orabug: 25885991

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
7 years agoudp: properly support MSG_PEEK with truncated buffers
Eric Dumazet [Wed, 30 Dec 2015 13:51:12 +0000 (08:51 -0500)]
udp: properly support MSG_PEEK with truncated buffers

Backport of this upstream commit into stable kernels :
89c22d8c3b27 ("net: Fix skb csum races when peeking")
exposed a bug in udp stack vs MSG_PEEK support, when user provides
a buffer smaller than skb payload.

In this case,
skb_copy_and_csum_datagram_iovec(skb, sizeof(struct udphdr),
                                 msg->msg_iov);
returns -EFAULT.

This bug does not happen in upstream kernels since Al Viro did a great
job to replace this into :
skb_copy_and_csum_datagram_msg(skb, sizeof(struct udphdr), msg);
This variant is safe vs short buffers.

For the time being, instead reverting Herbert Xu patch and add back
skb->ip_summed invalid changes, simply store the result of
udp_lib_checksum_complete() so that we avoid computing the checksum a
second time, and avoid the problematic
skb_copy_and_csum_datagram_iovec() call.

This patch can be applied on recent kernels as it avoids a double
checksumming, then backported to stable kernels as a bug fix.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 197c949e7798fbf28cfadc69d9ca0c2abbf93191)

Orabug: 25876402
CVE: CVE-2016-10229
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
7 years agonet/mlx4_core: panic the system on unrecoverable errors
Santosh Shilimkar [Wed, 7 Dec 2016 23:06:59 +0000 (15:06 -0800)]
net/mlx4_core: panic the system on unrecoverable errors

Mellanox catastrophic error recovery after device reset doesn't work and
in fact leads to unusable node for IB network since the HCA's ports
go down. At times hard reset is needed to get the system rebooted
which is a real problem in production environment. Once the
network outage detected, unreachable node gets evicted and rebooted
on engineered system using reboot. So hanged reboot command is
problematic. So the idea is let the kernel panic which can recover
system on its own with necessary logs captured. There was a debate
on whether to use panic or machine restart, but it was agreed to use
panic instead of silent reboot since thats the preferred option.

There is Mellanox case open to investigate this issue. As such this
is a rare case scenario and even if the issue is fixed, it is expected
to avoid leading to catas error case. This panic is limited to
only error case.

Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Orabug: 25873690
This is a change taken from QU2, it is not upstream.
(cherry picked from commit 271d694b34bd22e5632eaad41ea1d9a47f1bde3a)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoRevert "restrict /dev/mem to idle io memory ranges"
Chuck Anderson [Tue, 4 Apr 2017 21:59:51 +0000 (14:59 -0700)]
Revert "restrict /dev/mem to idle io memory ranges"

This reverts commit bf6ac7102b7207b2327a1b8259b89fd290b67412.
restrict /dev/mem to idle io memory ranges

There is an interaction with the bnx2i driver that prevents iSCSI logins.

Orabug: 25832750
Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoI/O ERROR WHEN A FILE ON ACFS FILESYSTEM IS ATTACHED TO THE GUEST DOMU
Joe Jin [Thu, 6 Apr 2017 23:56:57 +0000 (07:56 +0800)]
I/O ERROR WHEN A FILE ON ACFS FILESYSTEM IS ATTACHED TO THE GUEST DOMU

Orabug: 25831471

commit "block: loop: prepare for supporing direct IO" used old codes
from lkml, synced it with upstream codes.

Signed-off-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Aniket Alshi <aniket.alshi@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoxsigo: Fix spinlock release in case of error
Pradeep Gopanapalli [Tue, 14 Mar 2017 00:04:57 +0000 (17:04 -0700)]
xsigo: Fix spinlock release in case of error

Orabug: 25779803

In forwarding table lookup function, xve_fwt_lookup(),
in case of error condition xve_fwt lock is not released.
This commit fixes this bug by releasing xve_fwt lock on error.

Reviewed-by: Chien Yen <chien.yen@oracle.com>
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
7 years agomlx4_core: Add func name to common error strings to locate uniquely
Mukesh Kacker [Sun, 12 Feb 2017 00:42:56 +0000 (16:42 -0800)]
mlx4_core: Add func name to common error strings to locate uniquely

We add function names (and where needed line numbers) to
some repeated error strings so we can identify the failure
location uniquely for ease of debugging.

Orabug: 25440329

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Acked-by: Wengang Wang <wen.gang.wang@oracle.com>
7 years agoxsigo: Optimize xsvnic module parameters for UEK4
Pradeep Gopanapalli [Wed, 8 Mar 2017 00:29:21 +0000 (16:29 -0800)]
xsigo: Optimize xsvnic module parameters for UEK4

Orabug: 25779865

Enable Transmit interrupt.
Increase transmit and receive ring sizes to 2048 for
better performance.

Reviewed-by: Chien Yen <chien.yen@oracle.com>
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
7 years agoRevert "mlx4_ib: Memory leak on Dom0 with SRIOV."
Hakon Bugge [Wed, 5 Apr 2017 11:15:00 +0000 (13:15 +0200)]
Revert "mlx4_ib: Memory leak on Dom0 with SRIOV."

This reverts commit 08ec6789a9e36fcc849a5c4e172e599233747aa5.

Commit "mlx4_ib: Memory leak on Dom0 with SRIOV" introduced an error,
that the CM message DREQ was silently dropped by the PF passive side,
if the disconnect happened more than 5 seconds after the RTU was
received.

Orabug 25829233 documents that there is memory leak in the mlx4 driver
when the DomUs are destroyed while active. But this patchset does not
influence this leak. The leak is tracked by orabug 25946511.

This commit is a first step to make the uek4 tunneling proxy equal to
upstream and thereafter fix bugs both places.

Orabug: 25829233

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
7 years agoRevert "mlx4: avoid multiple free on id_map_ent"
Hakon Bugge [Wed, 5 Apr 2017 11:11:01 +0000 (13:11 +0200)]
Revert "mlx4: avoid multiple free on id_map_ent"

This reverts commit 1cae1a0ad2fe88499f2fd847d50b101d1985926b.

Commit "mlx4_ib: Memory leak on Dom0 with SRIOV" introduced an error,
that the CM message DREQ was silently dropped by the PF passive side,
if the disconnect happened more than 5 seconds after the RTU was
received.

In order to cleanly revert it, this dependant commit needs to be
reverted as well.

Orabug 25829233 documents that there is memory leak in the mlx4 driver
when the DomUs are destroyed while active. But this patchset does not
influence this leak. The leak is tracked by orabug 25946511.

Note that this commit also included a renaming of a variable. This
will be re-introduced in a later commit.

Orabug: 25829233

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
7 years agoDrivers: hv: vss: convert to hv_utils_transport
Vitaly Kuznetsov [Sun, 12 Apr 2015 01:07:52 +0000 (18:07 -0700)]
Drivers: hv: vss: convert to hv_utils_transport

Convert to hv_utils_transport to support both netlink and /dev/vmbus/hv_vss communication methods.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25819105
(cherry picked from commit 6472f80a2eeb34b442542bccd4d600e9251d9c36)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoDrivers: hv: vss: switch to using the hvutil_device_state state machine
Vitaly Kuznetsov [Sun, 12 Apr 2015 01:07:48 +0000 (18:07 -0700)]
Drivers: hv: vss: switch to using the hvutil_device_state state machine

Switch to using the hvutil_device_state state machine from using kvp_transaction.active.

State transitions are:
-> HVUTIL_DEVICE_INIT when driver loads or on device release
-> HVUTIL_READY if the handshake was successful
-> HVUTIL_HOSTMSG_RECEIVED when there is a non-negotiation message from the host
-> HVUTIL_USERSPACE_REQ after we sent the message to the userspace daemon
   -> HVUTIL_USERSPACE_RECV after/if the userspace daemon has replied
-> HVUTIL_READY after we respond to the host
-> HVUTIL_DEVICE_DYING on driver unload

In hv_vss_onchannelcallback() process ICMSGTYPE_NEGOTIATE messages even when
the userspace daemon is disconnected, otherwise we can make the host think
we don't support VSS and disable the service completely.

Unfortunately there is no good way we can figure out that the userspace daemon
has died (unless we start treating all timeouts as such), add a protection
against processing new VSS_OP_REGISTER messages while being in the middle of a
transaction (HVUTIL_USERSPACE_REQ or HVUTIL_USERSPACE_RECV state).

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25819105
(cherry picked from commit 086a6f68d6933d3c48b3898752cd6ca1a0e02aec)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoDrivers: hv: vss: process deferred messages when we complete the transaction
Vitaly Kuznetsov [Sun, 12 Apr 2015 01:07:43 +0000 (18:07 -0700)]
Drivers: hv: vss: process deferred messages when we complete the transaction

In theory, the host is not supposed to issue any requests before be reply to
the previous one. In KVP we, however, support the following scenarios:
1) A message was received before userspace daemon registered;
2) A message was received while the previous one is still being processed.
In VSS we support only the former. Add support for the later, use
hv_poll_channel() to do the job.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25819105
(cherry picked from commit 38c06c29bada78c4805000bfb9b7f19cd691461b)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoDrivers: hv: kvp: convert to hv_utils_transport
Vitaly Kuznetsov [Sun, 12 Apr 2015 01:07:54 +0000 (18:07 -0700)]
Drivers: hv: kvp: convert to hv_utils_transport

Convert to hv_utils_transport to support both netlink and /dev/vmbus/hv_kvp communication methods.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25819105
(cherry picked from commit 11bc3a5fa91f193b3d947a4cf51e21c4aa13292d)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoRevert "ipv4: use skb coalescing in defragmentation"
Florian Westphal [Fri, 10 Jul 2015 23:37:36 +0000 (01:37 +0200)]
Revert "ipv4: use skb coalescing in defragmentation"

This reverts commit 3cc4949269e01f39443d0fcfffb5bc6b47878d45.

There is nothing wrong with coalescing during defragmentation, it
reduces truesize overhead and simplifies things for the receiving
socket (no fraglist walk needed).

However, it also destroys geometry of the original fragments.
While that doesn't cause any breakage (we make sure to not exceed largest
original size) ip_do_fragment contains a 'fastpath' that takes advantage
of a present frag list and results in fragments that (in most cases)
match what was received.

In case its needed the coalescing could be done later, when we're sure
the skb is not forwarded.  But discussion during NFWS resulted in
'lets just remove this for now'.

Orabug: 25819103

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 14fe22e334623e451b5592193415c644005461ea)

Acked-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Conflicts:
net/ipv4/ip_fragment.c

7 years agoxfrm_user: validate XFRM_MSG_NEWAE incoming ESN size harder
Andy Whitcroft [Thu, 23 Mar 2017 07:45:44 +0000 (07:45 +0000)]
xfrm_user: validate XFRM_MSG_NEWAE incoming ESN size harder

Orabug: 25805996
CVE: CVE-2017-7184

Kees Cook has pointed out that xfrm_replay_state_esn_len() is subject to
wrapping issues.  To ensure we are correctly ensuring that the two ESN
structures are the same size compare both the overall size as reported
by xfrm_replay_state_esn_len() and the internal length are the same.

Signed-off-by: Andy Whitcroft <apw@canonical.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f843ee6dd019bcece3e74e76ad9df0155655d0df)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
7 years agoxfrm_user: validate XFRM_MSG_NEWAE XFRMA_REPLAY_ESN_VAL replay_window
Andy Whitcroft [Wed, 22 Mar 2017 07:29:31 +0000 (07:29 +0000)]
xfrm_user: validate XFRM_MSG_NEWAE XFRMA_REPLAY_ESN_VAL replay_window

Orabug: 25805996
CVE: CVE-2017-7184

When a new xfrm state is created during an XFRM_MSG_NEWSA call we
validate the user supplied replay_esn to ensure that the size is valid
and to ensure that the replay_window size is within the allocated
buffer.  However later it is possible to update this replay_esn via a
XFRM_MSG_NEWAE call.  There we again validate the size of the supplied
buffer matches the existing state and if so inject the contents.  We do
not at this point check that the replay_window is within the allocated
memory.  This leads to out-of-bounds reads and writes triggered by
netlink packets.  This leads to memory corruption and the potential for
priviledge escalation.

We already attempt to validate the incoming replay information in
xfrm_new_ae() via xfrm_replay_verify_len().  This confirms that the user
is not trying to change the size of the replay state buffer which
includes the replay_esn.  It however does not check the replay_window
remains within that buffer.  Add validation of the contained
replay_window.

Signed-off-by: Andy Whitcroft <apw@canonical.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 677e806da4d916052585301785d847c3b3e6186a)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
7 years agolpfc cannot establish connection with targets that send PRLI under P2P mode
Joe Jin [Wed, 22 Mar 2017 00:00:39 +0000 (08:00 +0800)]
lpfc cannot establish connection with targets that send PRLI under P2P mode

Orabug: 25802913

If lpfc rejects a PRLI that is sent from a target the target will not resend
and will reject the PRLI send from the initiator.

Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Joe Jin <joe.jin@oracle.com>
7 years agotty: n_hdlc: get rid of racy n_hdlc.tbuf
Alexander Popov [Tue, 28 Feb 2017 16:54:40 +0000 (19:54 +0300)]
tty: n_hdlc: get rid of racy n_hdlc.tbuf

Currently N_HDLC line discipline uses a self-made singly linked list for
data buffers and has n_hdlc.tbuf pointer for buffer retransmitting after
an error.

The commit be10eb7589337e5defbe214dae038a53dd21add8
("tty: n_hdlc add buffer flushing") introduced racy access to n_hdlc.tbuf.
After tx error concurrent flush_tx_queue() and n_hdlc_send_frames() can put
one data buffer to tx_free_buf_list twice. That causes double free in
n_hdlc_release().

Let's use standard kernel linked list and get rid of n_hdlc.tbuf:
in case of tx error put current data buffer after the head of tx_buf_list.

Signed-off-by: Alexander Popov <alex.popov@linux.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25802678
CVE: CVE-2017-2636
(cherry picked from commit 82f2341c94d270421f383641b7cd670e474db56b)
Signed-off-by: Todd Vierling <todd.vierling@oracle.com>
7 years agoTTY: n_hdlc, fix lockdep false positive
Jiri Slaby [Thu, 26 Nov 2015 18:28:26 +0000 (19:28 +0100)]
TTY: n_hdlc, fix lockdep false positive

The class of 4 n_hdls buf locks is the same because a single function
n_hdlc_buf_list_init is used to init all the locks. But since
flush_tx_queue takes n_hdlc->tx_buf_list.spinlock and then calls
n_hdlc_buf_put which takes n_hdlc->tx_free_buf_list.spinlock, lockdep
emits a warning:
=============================================
[ INFO: possible recursive locking detected ]
4.3.0-25.g91e30a7-default #1 Not tainted
---------------------------------------------
a.out/1248 is trying to acquire lock:
 (&(&list->spinlock)->rlock){......}, at: [<ffffffffa01fd020>] n_hdlc_buf_put+0x20/0x60 [n_hdlc]

but task is already holding lock:
 (&(&list->spinlock)->rlock){......}, at: [<ffffffffa01fdc07>] n_hdlc_tty_ioctl+0x127/0x1d0 [n_hdlc]

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(&list->spinlock)->rlock);
  lock(&(&list->spinlock)->rlock);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

2 locks held by a.out/1248:
 #0:  (&tty->ldisc_sem){++++++}, at: [<ffffffff814c9eb0>] tty_ldisc_ref_wait+0x20/0x50
 #1:  (&(&list->spinlock)->rlock){......}, at: [<ffffffffa01fdc07>] n_hdlc_tty_ioctl+0x127/0x1d0 [n_hdlc]
...
Call Trace:
...
 [<ffffffff81738fd0>] _raw_spin_lock_irqsave+0x50/0x70
 [<ffffffffa01fd020>] n_hdlc_buf_put+0x20/0x60 [n_hdlc]
 [<ffffffffa01fdc24>] n_hdlc_tty_ioctl+0x144/0x1d0 [n_hdlc]
 [<ffffffff814c25c1>] tty_ioctl+0x3f1/0xe40
...

Fix it by initializing the spin_locks separately. This removes also
reduntand memset of a freshly kzallocated space.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25802678
CVE: CVE-2017-2636
(cherry picked from commit e9b736d88af1a143530565929390cadf036dc799)
Signed-off-by: Todd Vierling <todd.vierling@oracle.com>
7 years agonet/llc: avoid BUG_ON() in skb_orphan()
Eric Dumazet [Sun, 12 Feb 2017 22:03:52 +0000 (14:03 -0800)]
net/llc: avoid BUG_ON() in skb_orphan()

It seems nobody used LLC since linux-3.12.

Fortunately fuzzers like syzkaller still know how to run this code,
otherwise it would be no fun.

Setting skb->sk without skb->destructor leads to all kinds of
bugs, we now prefer to be very strict about it.

Ideally here we would use skb_set_owner() but this helper does not exist yet,
only CAN seems to have a private helper for that.

Orabug: 25802599
CVE: CVE-2017-6345

Fixes: 376c7311bdb6 ("net: add a temporary sanity check in skb_orphan()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Aniket Alshi <aniket.alshi@oracle.com>
(cherry picked from commit 8b74d439e1697110c5e5c600643e823eb1dd0762)
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoip: fix IP_CHECKSUM handling
Paolo Abeni [Tue, 21 Feb 2017 08:33:18 +0000 (09:33 +0100)]
ip: fix IP_CHECKSUM handling

The skbs processed by ip_cmsg_recv() are not guaranteed to
be linear e.g. when sending UDP packets over loopback with
MSGMORE.
Using csum_partial() on [potentially] the whole skb len
is dangerous; instead be on the safe side and use skb_checksum().

Thanks to syzkaller team to detect the issue and provide the
reproducer.

v1 -> v2:
 - move the variable declaration in a tighter scope

Fixes: ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit ca4ef4574f1ee5252e2cd365f8f5d5bafd048f32)

Orabug: 25802576
CVE-2017-6347

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoudp: fix IP_CHECKSUM handling
Eric Dumazet [Mon, 24 Oct 2016 01:03:06 +0000 (18:03 -0700)]
udp: fix IP_CHECKSUM handling

First bug was added in commit ad6f939ab193 ("ip: Add offset parameter to
ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));

Then commit e6afc8ace6dd ("udp: remove headers from UDP packets before
queueing") forgot to adjust the offsets now UDP headers are pulled
before skb are put in receive queue.

Fixes: ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv")
Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Sam Kumar <samanthakumar@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 10df8e6152c6c400a563a673e9956320bfce1871)

Orabug: 25802576
CVE-2017-6347

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoudp: do not expect udp headers in recv cmsg IP_CMSG_CHECKSUM
Willem de Bruijn [Thu, 7 Apr 2016 22:12:59 +0000 (18:12 -0400)]
udp: do not expect udp headers in recv cmsg IP_CMSG_CHECKSUM

On udp sockets, recv cmsg IP_CMSG_CHECKSUM returns a checksum over
the packet payload. Since commit e6afc8ace6dd pulled the headers,
taking skb->data as the start of transport header is incorrect. Use
the transport header pointer.

Also, when peeking at an offset from the start of the packet, only
return a checksum from the start of the peeked data. Note that the
cmsg does not subtract a tail checkum when reading truncated data.

Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 31c2e4926fe912f88388bcaa8450fcaa8f2ece47)

Orabug: 25802576
CVE-2017-6347

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
7 years agotcp: avoid infinite loop in tcp_splice_read()
Eric Dumazet [Fri, 3 Feb 2017 22:59:38 +0000 (14:59 -0800)]
tcp: avoid infinite loop in tcp_splice_read()

Splicing from TCP socket is vulnerable when a packet with URG flag is
received and stored into receive queue.

__tcp_splice_read() returns 0, and sk_wait_data() immediately
returns since there is the problematic skb in queue.

This is a nice way to burn cpu (aka infinite loop) and trigger
soft lockups.

Again, this gem was found by syzkaller tool.

Orabug: 25802549
CVE: CVE-2017-6214

Fixes: 9c55e01c0cc8 ("[TCP]: Splice receive support.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by : Aniket Alshi <aniket.alsi@oracle.com>
(cherry picked from commit ccf7abb93af09ad0868ae9033d1ca8108bdaec82)
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
7 years agosctp: avoid BUG_ON on sctp_wait_for_sndbuf
Marcelo Ricardo Leitner [Mon, 6 Feb 2017 20:10:31 +0000 (18:10 -0200)]
sctp: avoid BUG_ON on sctp_wait_for_sndbuf

Alexander Popov reported that an application may trigger a BUG_ON in
sctp_wait_for_sndbuf if the socket tx buffer is full, a thread is
waiting on it to queue more data and meanwhile another thread peels off
the association being used by the first thread.

This patch replaces the BUG_ON call with a proper error handling. It
will return -EPIPE to the original sendmsg call, similarly to what would
have been done if the association wasn't found in the first place.

Acked-by: Alexander Popov <alex.popov@linux.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 2dcab598484185dea7ec22219c76dcdd59e3cb90)

Orabug: 25802515
CVE-2017-5986

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoext4: store checksum seed in superblock
Darrick J. Wong [Sat, 17 Oct 2015 20:16:02 +0000 (16:16 -0400)]
ext4: store checksum seed in superblock

Allow the filesystem to store the metadata checksum seed in the
superblock and add an incompat feature to say that we're using it.
This enables tune2fs to change the UUID on a mounted metadata_csum
FS without having to (racy!) rewrite all disk metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
(cherry picked from commit 8c81bd8f586c46eaf114758a78d82895a2b081c2)

Orabug: 25802481
CVE-2016-10208

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
 Conflicts:
fs/ext4/sysfs.c

7 years agoext4: reserve code points for the project quota feature
Theodore Ts'o [Sat, 17 Oct 2015 20:15:18 +0000 (16:15 -0400)]
ext4: reserve code points for the project quota feature

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
(cherry picked from commit 8b4953e13f4c5d9a3c869f5fca7d51e1700e7db0)

Orabug: 25802481
CVE-2016-10208

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoext4: validate s_first_meta_bg at mount time
Eryu Guan [Thu, 1 Dec 2016 20:08:37 +0000 (15:08 -0500)]
ext4: validate s_first_meta_bg at mount time

Ralf Spenneberg reported that he hit a kernel crash when mounting a
modified ext4 image. And it turns out that kernel crashed when
calculating fs overhead (ext4_calculate_overhead()), this is because
the image has very large s_first_meta_bg (debug code shows it's
842150400), and ext4 overruns the memory in count_overhead() when
setting bitmap buffer, which is PAGE_SIZE.

ext4_calculate_overhead():
  buf = get_zeroed_page(GFP_NOFS);  <=== PAGE_SIZE buffer
  blks = count_overhead(sb, i, buf);

count_overhead():
  for (j = ext4_bg_num_gdb(sb, grp); j > 0; j--) { <=== j = 842150400
          ext4_set_bit(EXT4_B2C(sbi, s++), buf);   <=== buffer overrun
          count++;
  }

This can be reproduced easily for me by this script:

  #!/bin/bash
  rm -f fs.img
  mkdir -p /mnt/ext4
  fallocate -l 16M fs.img
  mke2fs -t ext4 -O bigalloc,meta_bg,^resize_inode -F fs.img
  debugfs -w -R "ssv first_meta_bg 842150400" fs.img
  mount -o loop fs.img /mnt/ext4

Fix it by validating s_first_meta_bg first at mount time, and
refusing to mount if its value exceeds the largest possible meta_bg
number.

Reported-by: Ralf Spenneberg <ralf@os-t.de>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
(cherry picked from commit 3a4b77cd47bb837b8557595ec7425f281f2ca1fe)

Orabug: 25802481
CVE-2016-10208

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
7 years agoext4: clean up feature test macros with predicate functions
Darrick J. Wong [Sat, 17 Oct 2015 20:18:43 +0000 (16:18 -0400)]
ext4: clean up feature test macros with predicate functions

Create separate predicate functions to test/set/clear feature flags,
thereby replacing the wordy old macros.  Furthermore, clean out the
places where we open-coded feature tests.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
(cherry picked from commit e2b911c53584a92266943f3b7f2cdbc19c1a4e80)

Orabug: 25802481
CVE-2016-10208

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
 Conflicts:
fs/ext4/namei.c
fs/ext4/super.c

7 years agoKVM: x86: fix emulation of "MOV SS, null selector"
Paolo Bonzini [Thu, 12 Jan 2017 14:02:32 +0000 (15:02 +0100)]
KVM: x86: fix emulation of "MOV SS, null selector"

This is CVE-2017-2583.  On Intel this causes a failed vmentry because
SS's type is neither 3 nor 7 (even though the manual says this check is
only done for usable SS, and the dmesg splat says that SS is unusable!).
On AMD it's worse: svm.c is confused and sets CPL to 0 in the vmcb.

The fix fabricates a data segment descriptor when SS is set to a null
selector, so that CPL and SS.DPL are set correctly in the VMCS/vmcb.
Furthermore, only allow setting SS to a NULL selector if SS.RPL < 3;
this in turn ensures CPL < 3 because RPL must be equal to CPL.

Thanks to Andy Lutomirski and Willy Tarreau for help in analyzing
the bug and deciphering the manuals.

Orabug: 25802278
CVE: CVE-2017-2583

Reported-by: Xiaohan Zhang <zhangxiaohan1@huawei.com>
Fixes: 79d5b4c3cd809c770d4bf9812635647016c56011
Cc: stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 33ab91103b3415e12457e3104f0e4517ce12d0f3)

Signed-off-by: Aniket Alshi <aniket.alshi@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
7 years agogfs2: fix slab corruption during mounting and umounting gfs file system
Thomas Tai [Wed, 22 Mar 2017 17:52:11 +0000 (10:52 -0700)]
gfs2: fix slab corruption during mounting and umounting gfs file system

During mounting and unmounting GFS2 file system, kernel panic happens
due to slab memory corruption. The slab allocator suggests that it is
likely a double free memory corrruption. The issue is traced back to
v3.9-rc6 where a patch is submitted to use kzalloc() for storing a
bitmap instead of using a local variable. The intention is to allocate
memory during mounting and to free memory during unmounting. The original
patch misses a code path which has already freed the memory and caused
memory corruption. This patch sets the memory pointer to NULL after
the memory is freed, so that double free memory corruption will not
be happened.

gdlm_mount()
  '-- set_recover_size() which use kzalloc()
  '-- if dlm does not support ops callbacks then
          '--- free_recover_size() which use kfree()

gldm_unmount()
  '-- free_recover_size() which use kfree()

previous patch which introduce the double free issue is
commit 57c7310b8eb9 ("GFS2: use kmalloc for lvb bitmap")

orabug: 25253085
orabug: 25791662

Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
7 years agogfs2: handle NULL rgd in set_rgrp_preferences
Abhi Das [Tue, 5 May 2015 16:26:04 +0000 (11:26 -0500)]
gfs2: handle NULL rgd in set_rgrp_preferences

The function set_rgrp_preferences() does not handle the (rarely
returned) NULL value from gfs2_rgrpd_get_next() and this patch
fixes that.

The fs image in question is only 150MB in size which allows for
only 1 rgrp to be created. The in-memory rb tree has only 1 node
and when gfs2_rgrpd_get_next() is called on this sole rgrp, it
returns NULL. (Default behavior is to wrap around the rb tree and
return the first node to give the illusion of a circular linked
list. In the case of only 1 rgrp, we can't have
gfs2_rgrpd_get_next() return the same rgrp (first, last, next all
point to the same rgrp)... that would cause unintended consequences
and infinite loops.)

orabug: 25253085
Orabug: 25791662

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
(cherry picked from upstream commit 959b6717175713259664950f3bba2418b038f69a)
Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
7 years agoRevert "fix minor infoleak in get_user_ex()"
Brian Maly [Thu, 30 Mar 2017 20:42:17 +0000 (16:42 -0400)]
Revert "fix minor infoleak in get_user_ex()"

Orabug: 25790370
CVE: CVE-2016-9644

This reverts commit fc2cba1c03dc9da0668d1719a6ad47b39c26b574.

7 years agosched/wait: Fix signal handling in bit wait helpers
Peter Zijlstra [Tue, 1 Dec 2015 13:04:04 +0000 (14:04 +0100)]
sched/wait: Fix signal handling in bit wait helpers

Vladimir reported getting RCU stall warnings and bisected it back to
commit:

  743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions")

That commit inadvertently reversed the calls to schedule() and signal_pending(),
thereby not handling the case where the signal receives while we sleep.

Reported-by: Vladimir Murzin <vladimir.murzin@arm.com>
Tested-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mark.rutland@arm.com
Cc: neilb@suse.de
Cc: oleg@redhat.com
Fixes: 743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions")
Fixes: cbbce8220949 ("SCHED: add some "wait..on_bit...timeout()" interfaces.")
Link: http://lkml.kernel.org/r/20151201130404.GL3816@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 68985633bccb6066bf1803e316fbc6c1f5b796d6)

Orabug: 25416990
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-By: Dan Duval <dan.duval@oracle.com>
 Conflicts:
kernel/sched/wait.c

7 years agoxen-pcifront/hvm: Slurp up "pxm" entry and set NUMA node on PCIe device. (V5)
Konrad Rzeszutek Wilk [Sat, 11 Mar 2017 01:24:34 +0000 (20:24 -0500)]
xen-pcifront/hvm: Slurp up "pxm" entry and set NUMA node on PCIe device. (V5)

If the XenBus contains the "pci" (which by default
it does for both PV and HVM guests), then iterate over
all the entries there and see if there are any with "pxm-X"
key. If so those values are used to modify the NUMA locality
information for the PCIe devices that match.

Also support PCIe hotplug - in case this done during runtime.

This patch also depends on the Xen to expose via XenBus the
"pxm-%d" entries.

A bit of background:

_PXM in ACPI is used to tie all kind of ACPI devices to the SRAT
table.

The SRAT table is simple N CPU array that lists APIC IDs and the NUMA nodes
and their distance from each other. There are two types - processor
affinity and memory affinity. For example one can have on a 4 CPU
machine this processor affinity:

APIC_ID | NUMA id (nid)
--------+--------------
0       | 0
2       | 0
4       | 1
6       | 1

The _PXM tie in the NUMA (nid), so for this guest there can only be
two - 0 or 1.

The _PXM can be slapped on most anything in the DSDT, the Processors
(kind of redundant as it is in SRAT), but most importantly for us the
PCIe devices. Except that ACPI does not enumerate all kind of PCIe devices.

 Device (PCI0)
        {
            Name (_HID, EisaId ("PNP0A08") /* PCI Express Bus */)  // _HID: Hardware ID
..
            Name (_PXM, Zero)  // _PXM: Device Proximity
        }

  Device (PCI1)
        {
            Name (_HID, EisaId ("PNP0A08") /* PCI Express Bus */)  // _HID: Hardware ID
            Name (_CID, EisaId ("PNP0A03") /* PCI Bus */)  // _CID: Compatible ID
            Name (_PXM, 0x01)  // _PXM: Device Proximity
        }

And this nicely helps with the Linux OS (and Windows) enumerating the
PCIe bridges (the two above) _ONCE_ during bootup. Then when a device
is hotplugged under the bridges it is very clear to which NUMA domain
it belongs.

To recap, on normal hardware Linux scans _ONCE_ the DSDT during
bootup and _only_ evaluates the _PXM on bridges "PNP0A03".

ONCE.

On the QEMU guests that Xen provides we have exactly _one_ bridge.
And the PCIe are hotplugged as 'slots' under it.

The SR-IOV VFs we hot-plug in the guest are done during runtime (not
during bootup, that would be too easy).

This means to make this work we would need to implement in QEMU:
 1) Expand piix4 emulation to have bridges, PCIe bridges at bootup.
    And bridges also expose the "window" of what the size of the MMIO
    region is behind it (and the PCIe devices would fit in there).

 2). Create up to NUMA node of these PCI bridges with the _PXM
    information.

 3). Then during PCI hotplug would decide which bridge based on the
    NUMA locality.

That is hard. The 1) is especially difficult as we have no idea
how big MMIO bar the device plugged in will be!

Fortunatly Intel resolved this with the Intel VT-D. It has a hotplug
capability so you can insert a brand new PCIe bridge at any point.
This is how ThunderBolt works in essence.

This would mean that in QEMU we would need to:
 4). Emulate in QEMU an IOMMU VT-d with PCI hotplug support.

Recognizing that 1-4 may take some time, and would need to be
done upstream first I decided to take a bit of shortcut.

Mainly that:
 1) This only needs to work for ExaData which uses our kernel (UEK)
 2) We already expose some of this information on XenBus.
 3) Once upstream is done this can be easily 'dropped'.

In fact, 2) exposes a lot of information:

/libxl/1/device/pci/0 = ""
/libxl/1/device/pci/0/frontend = "/local/domain/1/device/pci/0"
/libxl/1/device/pci/0/backend = "/local/domain/0/backend/pci/1/0"
.. snip..
/libxl/1/device/pci/0/frontend-id = "1"
/libxl/1/device/pci/0/online = "1"
/libxl/1/device/pci/0/state = "1"
/libxl/1/device/pci/0/key-0 = "0000:04:00.0"
/libxl/1/device/pci/0/dev-0 = "0000:04:00.0"
/libxl/1/device/pci/0/vdevfn-0 = "28"
/libxl/1/device/pci/0/opts-0 = "msitranslate=0,power_mgmt=0,permissive=0"
/libxl/1/device/pci/0/state-0 = "1"
/libxl/1/device/pci/0/key-1 = "0000:07:00.0"
/libxl/1/device/pci/0/dev-1 = "0000:07:00.0"
/libxl/1/device/pci/0/vdevfn-1 = "30"
/libxl/1/device/pci/0/opts-1 = "msitranslate=0,power_mgmt=0,permissive=0"
/libxl/1/device/pci/0/state-1 = "1"
/libxl/1/device/pci/0/num_devs = "2"

The 'vdevfn' is the slot:function value. 28 is 00:05.0 and 30
is 00:06:0 and that corresponds to (inside of the guest):
-bash-4.1# lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB Controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Class ff80: XenSource, Inc. Xen Platform Device (rev 01)
00:05.0 USB Controller: NEC Corporation Device 0194 (rev 03)
00:06.0 USB Controller: NEC Corporation Device 0194 (rev 04)

This 'vdevfn' is created by QEMU when the device is hotplugged
(or at bootup time).

So I figured it we have an extra key:

/libxl/1/device/pci/0/pxm-0 = "1"
/libxl/1/device/pci/0/pxm-1 = "1"

We could use that inside of the guest to call 'set_dev_node'
on the PCIe devices.

And in fact, for the above the lspci output we can make this work:

-bash-4.1# find /sys -name local_cpulist
/sys/devices/pci0000:00/0000:00:00.0/local_cpulist
/sys/devices/pci0000:00/0000:00:01.3/local_cpulist
/sys/devices/pci0000:00/0000:00:03.0/local_cpulist
/sys/devices/pci0000:00/0000:00:01.1/local_cpulist
/sys/devices/pci0000:00/0000:00:06.0/local_cpulist <===
/sys/devices/pci0000:00/0000:00:02.0/local_cpulist
/sys/devices/pci0000:00/0000:00:05.0/local_cpulist <===
/sys/devices/pci0000:00/0000:00:01.2/local_cpulist
/sys/devices/pci0000:00/0000:00:01.0/local_cpulist
-bash-4.1# find /sys -name local_cpulist | xargs cat
-3
0-3
0-3
0-3
2-3 <===
0-3
2-3 <===
0-3
0-3

-bash-4.1# find /sys -name cpulist
/sys/devices/system/node/node0/cpulist
/sys/devices/system/node/node1/cpulist
-bash-4.1# find /sys -name cpulist | xargs cat
0-1
2-3

With this guest config:
kernel = "hvmloader"
device_model_version = 'qemu-xen-traditional'
builder='hvm'
memory=1024
serial='pty'
smt=1
vcpus = 4
cpus=['0-3']
name="bootstrap-x86_64-pvhvm"
disk = [ 'file:/mnt/lab/bootstrap-x86_64/root_image.iso,hdc:cdrom,r','phy:/dev/guests/bootstrap-x86_64-pvhvm,xvda,w']
boot="dn"
vif = [ 'mac=00:0F:4B:00:00:68, bridge=switch' ]
vnc=1
vnclisten="0.0.0.0"
usb=1
usbdevice="tablet"
xen_platform_pci=1
vnuma=[ ["pnode=0", "size=512", "vcpus=0,1","vdistances=10,20"],
        ["pnode=0", "size=512", "vcpus=2,3","vdistances=20,10"]]

pci=['04:00.0','07:00.0']

And
During bootup you would see:
[   11.227821] calling  pcifront_init+0x0/0x118 @ 1
[   11.241826] pcifront_hvm pci-0: PCI device 0000:00:05.0 (PXM=1)
[   11.261298] pci 0000:00:05.0: Updating PXM to 1
[   11.273989] initcall pcifront_init+0x0/0x118 returned 0 after 32791 usecs
[   11.274620] pcifront_hvm pci-0: PCI device 0000:00:05.0 (PXM=1)
[   11.276977] pcifront_hvm pci-0: PCI device 0000:00:05.0 (PXM=1)

OraBug: 25788744

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
---
v1: Fixed all the checkpatch.pl issues
       Made it respect the node_online() in case the backend provided
       a larger value than there are NUMA nodes.
v2: Fixed per Boris's reviews.
v3: Added a mechanism to prune the list when devices are removed.
v4: s/l/len
    Added space after 'len' in decleration.
    Fixed comments
    Added Reviewed-by.
v5: Added Boris's Reviewed-by

7 years agoIB/CORE: sync the resouce access in fmr_pool
Wengang Wang [Fri, 10 Mar 2017 17:21:01 +0000 (09:21 -0800)]
IB/CORE: sync the resouce access in fmr_pool

orabug: 25677461

There were some problem in the fmr_pool code that either was missing lock
protection or was using wrong lock when allocating/freeing/looking up resource
in the FMR pool.

Covering all above issues, the code turns out that every where we need lock
protection we need both the pool_lock and used_pool_lock. So this patch also
removes the used_pool_lock and keeps the pool lock and make the later sync
all the accesses.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Avinash Repaka <avinash.repaka@oracle.com>
7 years agonet: ping: check minimum size on ICMP header length
Kees Cook [Mon, 5 Dec 2016 18:34:38 +0000 (10:34 -0800)]
net: ping: check minimum size on ICMP header length

Orabug: 25766884
CVE: CVE-2016-8399

Prior to commit c0371da6047a ("put iov_iter into msghdr") in v3.19, there
was no check that the iovec contained enough bytes for an ICMP header,
and the read loop would walk across neighboring stack contents. Since the
iov_iter conversion, bad arguments are noticed, but the returned error is
EFAULT. Returning EINVAL is a clearer error and also solves the problem
prior to v3.19.

This was found using trinity with KASAN on v3.18:

BUG: KASAN: stack-out-of-bounds in memcpy_fromiovec+0x60/0x114 at addr ffffffc071077da0
Read of size 8 by task trinity-c2/9623
page:ffffffbe034b9a08 count:0 mapcount:0 mapping:          (null) index:0x0
flags: 0x0()
page dumped because: kasan: bad access detected
CPU: 0 PID: 9623 Comm: trinity-c2 Tainted: G    BU         3.18.0-dirty #15
Hardware name: Google Tegra210 Smaug Rev 1,3+ (DT)
Call trace:
[<ffffffc000209c98>] dump_backtrace+0x0/0x1ac arch/arm64/kernel/traps.c:90
[<ffffffc000209e54>] show_stack+0x10/0x1c arch/arm64/kernel/traps.c:171
[<     inline     >] __dump_stack lib/dump_stack.c:15
[<ffffffc000f18dc4>] dump_stack+0x7c/0xd0 lib/dump_stack.c:50
[<     inline     >] print_address_description mm/kasan/report.c:147
[<     inline     >] kasan_report_error mm/kasan/report.c:236
[<ffffffc000373dcc>] kasan_report+0x380/0x4b8 mm/kasan/report.c:259
[<     inline     >] check_memory_region mm/kasan/kasan.c:264
[<ffffffc00037352c>] __asan_load8+0x20/0x70 mm/kasan/kasan.c:507
[<ffffffc0005b9624>] memcpy_fromiovec+0x5c/0x114 lib/iovec.c:15
[<     inline     >] memcpy_from_msg include/linux/skbuff.h:2667
[<ffffffc000ddeba0>] ping_common_sendmsg+0x50/0x108 net/ipv4/ping.c:674
[<ffffffc000dded30>] ping_v4_sendmsg+0xd8/0x698 net/ipv4/ping.c:714
[<ffffffc000dc91dc>] inet_sendmsg+0xe0/0x12c net/ipv4/af_inet.c:749
[<     inline     >] __sock_sendmsg_nosec net/socket.c:624
[<     inline     >] __sock_sendmsg net/socket.c:632
[<ffffffc000cab61c>] sock_sendmsg+0x124/0x164 net/socket.c:643
[<     inline     >] SYSC_sendto net/socket.c:1797
[<ffffffc000cad270>] SyS_sendto+0x178/0x1d8 net/socket.c:1761

CVE-2016-8399

Reported-by: Qidan He <i@flanker017.me>
Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 0eab121ef8750a5c8637d51534d5e9143fb0633f)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
7 years agoscsi: sg: check length passed to SG_NEXT_CMD_LEN
peter chang [Wed, 15 Feb 2017 22:11:54 +0000 (14:11 -0800)]
scsi: sg: check length passed to SG_NEXT_CMD_LEN

Orabug: 25751395
CVE: CVE-2017-7187

The user can control the size of the next command passed along, but the
value passed to the ioctl isn't checked against the usable max command
size.

Cc: <stable@vger.kernel.org>
Signed-off-by: Peter Chang <dpf@google.com>
Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
7 years agoxen-netfront: Rework the fix for Rx stall during OOM and network stress
Dongli Zhang [Tue, 21 Mar 2017 23:24:56 +0000 (07:24 +0800)]
xen-netfront: Rework the fix for Rx stall during OOM and network stress

Orabug: 25747721

The commit 90c311b0eeea ("xen-netfront: Fix Rx stall during network
stress and OOM") caused the refill timer to be triggerred almost on
all invocations of xennet_alloc_rx_buffers for certain workloads.
This reworks the fix by reverting to the old behaviour and taking into
consideration the skb allocation failure. Refill timer is now triggered
on insufficient requests or skb allocation failure.

Signed-off-by: Vineeth Remanan Pillai <vineethp@amazon.com>
Fixes: 90c311b0eeea (xen-netfront: Fix Rx stall during network stress and OOM)
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Backport from upstream 538d92912d3190a1dd809233a0d57277459f37b2

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Acked-By: Joe Jin <joe.jin@oracle.com>
7 years agoxen-netfront: Fix Rx stall during network stress and OOM
Dongli Zhang [Tue, 21 Mar 2017 23:24:08 +0000 (07:24 +0800)]
xen-netfront: Fix Rx stall during network stress and OOM

Orabug: 25747721

During an OOM scenario, request slots could not be created as skb
allocation fails. So the netback cannot pass in packets and netfront
wrongly assumes that there is no more work to be done and it disables
polling. This causes Rx to stall.

The issue is with the retry logic which schedules the timer if the
created slots are less than NET_RX_SLOTS_MIN. The count of new request
slots to be pushed are calculated as a difference between new req_prod
and rsp_cons which could be more than the actual slots, if there are
unconsumed responses.

The fix is to calculate the count of newly created slots as the
difference between new req_prod and old req_prod.

Signed-off-by: Vineeth Remanan Pillai <vineethp@amazon.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Backport from upstream 90c311b0eeead647b708a723dbdde1eda3dcad05

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Acked-By: Joe Jin <joe.jin@oracle.com>
7 years agoipc/shm: Fix shmat mmap nil-page protection
Davidlohr Bueso [Mon, 27 Feb 2017 22:28:24 +0000 (14:28 -0800)]
ipc/shm: Fix shmat mmap nil-page protection

The issue is described here, with a nice testcase:

    https://bugzilla.kernel.org/show_bug.cgi?id=192931

The problem is that shmat() calls do_mmap_pgoff() with MAP_FIXED, and
the address rounded down to 0.  For the regular mmap case, the
protection mentioned above is that the kernel gets to generate the
address -- arch_get_unmapped_area() will always check for MAP_FIXED and
return that address.  So by the time we do security_mmap_addr(0) things
get funky for shmat().

The testcase itself shows that while a regular user crashes, root will
not have a problem attaching a nil-page.  There are two possible fixes
to this.  The first, and which this patch does, is to simply allow root
to crash as well -- this is also regular mmap behavior, ie when hacking
up the testcase and adding mmap(...  |MAP_FIXED).  While this approach
is the safer option, the second alternative is to ignore SHM_RND if the
rounded address is 0, thus only having MAP_SHARED flags.  This makes the
behavior of shmat() identical to the mmap() case.  The downside of this
is obviously user visible, but does make sense in that it maintains
semantics after the round-down wrt 0 address and mmap.

Passes shm related ltp tests.

Link: http://lkml.kernel.org/r/1486050195-18629-1-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Gareth Evans <gareth.evans@contextis.co.uk>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 95e91b831f87ac8e1f8ed50c14d709089b4e01b8)

Orabug: 25717094
CVE-2017-5669

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
7 years agoMerge branch 'topic/uek-4.1/dtrace' into uek/uek-next
Chuck Anderson [Mon, 29 May 2017 19:53:39 +0000 (12:53 -0700)]
Merge branch 'topic/uek-4.1/dtrace' into uek/uek-next

* topic/uek-4.1/dtrace:
  dtrace: improve io provider coverage

7 years agosg_write()/bsg_write() is not fit to be called under KERNEL_DS
Al Viro [Fri, 16 Dec 2016 18:42:06 +0000 (13:42 -0500)]
sg_write()/bsg_write() is not fit to be called under KERNEL_DS

Orabug: 25340071
CVE: CVE-2016-10088

Both damn things interpret userland pointers embedded into the payload;
worse, they are actually traversing those.  Leaving aside the bad
API design, this is very much _not_ safe to call with KERNEL_DS.
Bail out early if that happens.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 128394eff343fc6d2f32172f03e24829539c5835)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
7 years agotcp: fix potential memory corruption
Eric Dumazet [Wed, 2 Nov 2016 14:53:17 +0000 (07:53 -0700)]
tcp: fix potential memory corruption

Imagine initial value of max_skb_frags is 17, and last
skb in write queue has 15 frags.

Then max_skb_frags is lowered to 14 or smaller value.

tcp_sendmsg() will then be allowed to add additional page frags
and eventually go past MAX_SKB_FRAGS, overflowing struct
skb_shared_info.

Orabug: 25140382

Fixes: 5f74f82ea34c ("net:Add sysctl_max_skb_frags")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Cc: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit ac9e70b17ecd7c6e933ff2eaf7ab37429e71bf4d)

Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
7 years agoblock: fix use-after-free in seq file
Vegard Nossum [Fri, 29 Jul 2016 08:40:31 +0000 (10:40 +0200)]
block: fix use-after-free in seq file

Orabug: 25134541
CVE: CVE-2016-7910

I got a KASAN report of use-after-free:

    ==================================================================
    BUG: KASAN: use-after-free in klist_iter_exit+0x61/0x70 at addr ffff8800b6581508
    Read of size 8 by task trinity-c1/315
    =============================================================================
    BUG kmalloc-32 (Not tainted): kasan: bad access detected
    -----------------------------------------------------------------------------

    Disabling lock debugging due to kernel taint
    INFO: Allocated in disk_seqf_start+0x66/0x110 age=144 cpu=1 pid=315
            ___slab_alloc+0x4f1/0x520
            __slab_alloc.isra.58+0x56/0x80
            kmem_cache_alloc_trace+0x260/0x2a0
            disk_seqf_start+0x66/0x110
            traverse+0x176/0x860
            seq_read+0x7e3/0x11a0
            proc_reg_read+0xbc/0x180
            do_loop_readv_writev+0x134/0x210
            do_readv_writev+0x565/0x660
            vfs_readv+0x67/0xa0
            do_preadv+0x126/0x170
            SyS_preadv+0xc/0x10
            do_syscall_64+0x1a1/0x460
            return_from_SYSCALL_64+0x0/0x6a
    INFO: Freed in disk_seqf_stop+0x42/0x50 age=160 cpu=1 pid=315
            __slab_free+0x17a/0x2c0
            kfree+0x20a/0x220
            disk_seqf_stop+0x42/0x50
            traverse+0x3b5/0x860
            seq_read+0x7e3/0x11a0
            proc_reg_read+0xbc/0x180
            do_loop_readv_writev+0x134/0x210
            do_readv_writev+0x565/0x660
            vfs_readv+0x67/0xa0
            do_preadv+0x126/0x170
            SyS_preadv+0xc/0x10
            do_syscall_64+0x1a1/0x460
            return_from_SYSCALL_64+0x0/0x6a

    CPU: 1 PID: 315 Comm: trinity-c1 Tainted: G    B           4.7.0+ #62
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
     ffffea0002d96000 ffff880119b9f918 ffffffff81d6ce81 ffff88011a804480
     ffff8800b6581500 ffff880119b9f948 ffffffff8146c7bd ffff88011a804480
     ffffea0002d96000 ffff8800b6581500 fffffffffffffff4 ffff880119b9f970
    Call Trace:
     [<ffffffff81d6ce81>] dump_stack+0x65/0x84
     [<ffffffff8146c7bd>] print_trailer+0x10d/0x1a0
     [<ffffffff814704ff>] object_err+0x2f/0x40
     [<ffffffff814754d1>] kasan_report_error+0x221/0x520
     [<ffffffff8147590e>] __asan_report_load8_noabort+0x3e/0x40
     [<ffffffff83888161>] klist_iter_exit+0x61/0x70
     [<ffffffff82404389>] class_dev_iter_exit+0x9/0x10
     [<ffffffff81d2e8ea>] disk_seqf_stop+0x3a/0x50
     [<ffffffff8151f812>] seq_read+0x4b2/0x11a0
     [<ffffffff815f8fdc>] proc_reg_read+0xbc/0x180
     [<ffffffff814b24e4>] do_loop_readv_writev+0x134/0x210
     [<ffffffff814b4c45>] do_readv_writev+0x565/0x660
     [<ffffffff814b8a17>] vfs_readv+0x67/0xa0
     [<ffffffff814b8de6>] do_preadv+0x126/0x170
     [<ffffffff814b92ec>] SyS_preadv+0xc/0x10

This problem can occur in the following situation:

open()
 - pread()
    - .seq_start()
       - iter = kmalloc() // succeeds
       - seqf->private = iter
    - .seq_stop()
       - kfree(seqf->private)
 - pread()
    - .seq_start()
       - iter = kmalloc() // fails
    - .seq_stop()
       - class_dev_iter_exit(seqf->private) // boom! old pointer

As the comment in disk_seqf_stop() says, stop is called even if start
failed, so we need to reinitialise the private pointer to NULL when seq
iteration stops.

An alternative would be to set the private pointer to NULL when the
kmalloc() in disk_seqf_start() fails.

Cc: stable@vger.kernel.org
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 77da160530dd1dc94f6ae15a981f24e5f0021e84)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
7 years agoxfs: Correctly lock inode when removing suid and file capabilities
Jan Kara [Thu, 21 May 2015 14:05:56 +0000 (16:05 +0200)]
xfs: Correctly lock inode when removing suid and file capabilities

From a6de82cab123beaf9406024943caa0242f0618b0 Mon Sep 17 00:00:00 2001

Currently XFS calls file_remove_privs() without holding i_mutex. This is
wrong because that function can end up messing with file permissions and
file capabilities stored in xattrs for which we need i_mutex held.

Fix the problem by grabbing iolock exclusively when we will need to
change anything in permissions / xattrs.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Orabug: 24803533
Signed-off-by: darrick.wong@oracle.com
7 years agofs: Call security_ops->inode_killpriv on truncate
Jan Kara [Thu, 21 May 2015 14:05:55 +0000 (16:05 +0200)]
fs: Call security_ops->inode_killpriv on truncate

From 45f147a1bc97c743c6101a8d2741c69a51f583e4 Mon Sep 17 00:00:00 2001

Comment in include/linux/security.h says that ->inode_killpriv() should
be called when setuid bit is being removed and that similar security
labels (in fact this applies only to file capabilities) should be
removed at this time as well. However we don't call ->inode_killpriv()
when we remove suid bit on truncate.

We fix the problem by calling ->inode_need_killpriv() and subsequently
->inode_killpriv() on truncate the same way as we do it on file write.

After this patch there's only one user of should_remove_suid() - ocfs2 -
and indeed it's buggy because it doesn't call ->inode_killpriv() on
write. However fixing it is difficult because of special locking
constraints.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Orabug: 24803533
Signed-off-by: darrick.wong@oracle.com
7 years agofs: Provide function telling whether file_remove_privs() will do anything
Jan Kara [Thu, 21 May 2015 14:05:54 +0000 (16:05 +0200)]
fs: Provide function telling whether file_remove_privs() will do anything

From dbfae0cdcd87602737101d4417811f4323156b54 Mon Sep 17 00:00:00 2001

Provide function telling whether file_remove_privs() will do anything.
Currently we only have should_remove_suid() and that does something
slightly different.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Orabug: 24803533
Signed-off-by: darrick.wong@oracle.com
7 years agofs: Rename file_remove_suid() to file_remove_privs()
Jan Kara [Thu, 21 May 2015 14:05:53 +0000 (16:05 +0200)]
fs: Rename file_remove_suid() to file_remove_privs()

From 5fa8e0a1c6a762857ae67d1628c58b9a02362003 Mon Sep 17 00:00:00 2001

file_remove_suid() is a misnomer since it removes also file capabilities
stored in xattrs and sets S_NOSEC flag. Also should_remove_suid() tells
something else than whether file_remove_suid() call is necessary which
leads to bugs.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Orabug: 24803533
Signed-off-by: darrick.wong@oracle.com
7 years agoIB/uverbs: Fix leak of XRC target QPs
Tariq Toukan [Thu, 27 Oct 2016 13:36:26 +0000 (16:36 +0300)]
IB/uverbs: Fix leak of XRC target QPs

The real QP is destroyed in case of the ref count reaches zero, but
for XRC target QPs this call was missed and caused to QP leaks.

Let's call to destroy for all flows.

Orabug: 24761732

Fixes: 0e0ec7e0638e ('RDMA/core: Export ib_open_qp() to share XRC...')
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit 5b810a242c28e1d8d64d718cebe75b79d86a0b2d)
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
7 years agoSome unsupported ioctls get logged unnecessarily
Venkat Venkatsubra [Tue, 20 Dec 2016 19:55:39 +0000 (11:55 -0800)]
Some unsupported ioctls get logged unnecessarily

IPoIB logs messages such as "ib0: ioctl fail to copy request data".

Orabug: 24510137

Acked-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
7 years agoIB/ipoib: Expose acl_enable sysfs file as read only
Yuval Shaia [Sun, 30 Apr 2017 04:13:50 +0000 (00:13 -0400)]
IB/ipoib: Expose acl_enable sysfs file as read only

This file can be used to determine if ipoib supports IB-ACL.
In debug mode all sysfs files are exposed in full mode.
In non-debug mode only acl_enable is exposed but in real only mode.

Orabug: 25993951

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Mukesh Kacker <mukesh.kacker@oracle.com>