www.infradead.org Git - users/jedix/linux-maple.git/log

block: loop: Enable directIO whenever possible

The patches intended for mainline required a user-space change to
losetup setting O_DIRECT on the backing file. We avoid this for UEK.

I also made some changes to keep lo->lo_flags (LO_FLAGS_DIRECT_IO) and
lo->use_dio in sync.

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

Merge branch 'topic/uek-4.1/aio-dio' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/aio-dio' of git://ca-git.us.oracle.com/linux-uek:
  block: loop: support DIO & AIO
  block: loop: prepare for supporing direct IO
  block: loop: use kthread_work
  block: loop: set QUEUE_FLAG_NOMERGES for request queue of loop
  fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read
  nfs: don't dirty kernel pages read by direct-io
  block: loop: avoiding too many pending per work I/O
  block: loop: convert to per-device workqueue

Merge branch 'topic/uek-4.1/rpm-build' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/rpm-build' of git://ca-git.us.oracle.com/linux-uek:
  uek-rpm: configs: Enabel Oracle HXGE and ASM driver
  uek-rpm: build: Add rpm build environment for ol6/ol7
  uek-rpm: configs: Create baseline config for uek4[ol6/ol7]

Merge branch 'topic/uek-4.1/drivers' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/drivers' of git://ca-git.us.oracle.com/linux-uek: (24 commits)
  megaraid_sas: Permit large RAID0/1 requests
  megaraid_sas : Modify return value of megasas_issue_blocked_cmd() and wait_and_poll() to consider command status returned by firmware
  megaraid_sas : swap whole register in megasas_register_aen
  megaraid_sas : fix megasas_fire_cmd_fusion calling convention
  megaraid_sas : add missing byte swaps to the sriov code
  megaraid_sas : bytewise or should be done on native endian variables
  megaraid_sas : move endianness conversion into caller of megasas_get_seq_num
  megaraid_sas : add endianness conversions for all ones
  megaraid_sas : add endianness annotations
  megaraid_sas : add missing __iomem annotations
  megaraid_sas : megasas_complete_outstanding_ioctls() can be static
  megaraid_sas : Support for Avago's Single server High Availability product
  megaraid_sas : Add release date and update driver version
  megaraid_sas : Modify driver's meta data to reflect Avago
  megaraid_sas : Use Block layer tag support for internal command indexing
  megaraid_sas : Enhanced few prints
  megaraid_sas : Move controller's queue depth calculation in adapter specific function
  megaraid_sas : Add separate functions for building sysPD IOs and non RW LDIOs
  megaraid_sas : Add separate function for refiring MFI commands
  megaraid_sas : Add separate function for setting up IRQs
  ...

Merge branch 'topic/uek-4.1/oracleasm' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/oracleasm' of git://ca-git.us.oracle.com/linux-uek:
  oracleasm: Fix trace output for warn_asm_ioc and check_asm_ioc
  oracleasm: Fix occasional I/O stall due to merge error
  oracleasm: Classify device connectivity issues as global errors
  oracleasm: Deprecate mlog and implement support for tracepoints
  oracleasm: Abolish mlog usage in integrity.c and clean up error printing.
  oracleasm: Various code and whitespace cleanups.
  oracleasm: 4.0 compat changes
  oracleasm: Compat changes for 3.18
  oracleasm: claim FMODE_EXCL access on disk during asm_open
  oracleasm: Restrict logical block size reporting
  oracleasm: Report logical block size
  oracleasm: Compat changes for 3.10
  oracleasm: Add support for new error return codes from block/SCSI
  oracleasm: Compat changes for 3.8
  oracleasm: Compat changes for 3.5
  oracleasm: Introduce module parameter for block size selection
  oracleasm: Data integrity support
  oracleasm: Fix two merge errors
  Oracle ASM Kernel Driver

Merge branch 'topic/uek-4.1/fuse' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/fuse' of git://ca-git.us.oracle.com/linux-uek:
  fuse: fix typo while displaying fuse numa mount option
  fuse: add numa mount option
  fuse: modify queues, allocation and locking for multiple nodes
  fuse: add spinlock to protect fc reqctr
  fuse: add fuse node struct

Merge branch 'topic/uek-4.1/ocfs2' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/ocfs2' of git://ca-git.us.oracle.com/linux-uek:
  ocfs2: call ocfs2_journal_access_di() before ocfs2_journal_dirty() in ocfs2_write_end_nolock()
  ocfs2: avoid access invalid address when read o2dlm debug messages
  ocfs2: make 'buffered' as the default coherency option
  ocfs2: Suppress the error message from being printed in ocfs2_rename
  ocfs2: Tighten free bit calculation in the global bitmap
  ocfs2/trivial: Limit unaligned aio+dio write messages to once per day
  ocfs2/trivial: Print message indicating unaligned aio+dio write

Merge branch 'topic/uek-4.1/uek-carry' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/uek-carry' of git://ca-git.us.oracle.com/linux-uek:
  cdc-acm: Increase number of devices to 64
  ipmi: make kcs timeout parameters as module options
  x86: perf: prevent spurious PMU NMIs on Haswell systems
  x86/simplefb: simplefb was broken on Oracle and HP system, skip VIDEO_TYPE_EFI
  x86, fpu: Avoid possible error in math_state_restore()
  kernel: freezer: restore TIF_FREEZE
  ksplice: Clear garbage data on the kernel stack when handling signals
  sched: Disable default sched_autogroup to avoid the DBA performance regression
  x86: add support for crashkernel=auto

Merge branch 'topic/uek-4.1/xen' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/xen' of git://ca-git.us.oracle.com/linux-uek:
  xen/microcode: Use dummy microcode_ops for non initial domain guest
  xen/microcode: Fix compile warning.
  microcode_xen: Add support for AMD family >= 15h
  x86/microcode: check proper return code.
  xen: add CPU microcode update driver
  x86/xen: Disable APIC PM for Xen PV guests
  xen/pvhvm: Support more than 32 VCPUs when migrating (v3).

ocfs2: call ocfs2_journal_access_di() before ocfs2_journal_dirty() in ocfs2_write_end_nolock()

After we call ocfs2_journal_access_di() in ocfs2_write_begin(),
jbd2_journal_restart() may also be called, in this function transaction
A's t_updates-- and obtains a new transaction B.  If
jbd2_journal_commit_transaction() is happened to commit transaction A,
when t_updates==0, it will continue to complete commit and unfile buffer.

So when jbd2_journal_dirty_metadata(), the handle is pointed a new
transaction B, and the buffer head's journal head is already freed,
jh->b_transaction == NULL, jh->b_next_transaction == NULL, it returns
EINVAL, So it triggers the BUG_ON(status).

thread 1:                             jbd2:
ocfs2_write_begin                     jbd2_journal_commit_transaction
ocfs2_write_begin_nolock
  ocfs2_start_trans
    jbd2__journal_start(t_updates+1,
                       transaction A)
    ocfs2_journal_access_di
    ocfs2_write_cluster_by_desc
      ocfs2_mark_extent_written
        ocfs2_change_extent_flag
          ocfs2_split_extent
            ocfs2_extend_rotate_transaction
              jbd2_journal_restart
              (t_updates-1,transaction B) t_updates==0
                                        __jbd2_journal_refile_buffer

ocfs2_write_end
ocfs2_write_end_nolock
    ocfs2_journal_dirty
        jbd2_journal_dirty_metadata(bug)
   ocfs2_commit_trans

In ext4, I found that: jbd2_journal_get_write_access() called by

ext4_write_end.
ext4_write_begin
    ext4_journal_start
        __ext4_journal_start_sb
            ext4_journal_check_start
            jbd2__journal_start

ext4_write_end
    ext4_mark_inode_dirty
        ext4_reserve_inode_write
            ext4_journal_get_write_access
                jbd2_journal_get_write_access
        ext4_mark_iloc_dirty
            ext4_do_update_inode
                ext4_handle_dirty_metadata
                    jbd2_journal_dirty_metadata

So I think we should put ocfs2_journal_access_di before
  ocfs2_journal_dirty in the ocfs2_write_end.  and it works well after my
  modification.

Signed-off-by: vicky <vicky.yangwenfang@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 19bf7feab124221625b5c811b6192fff4e0cbb96)

ocfs2: avoid access invalid address when read o2dlm debug messages

The following case will lead to a lockres is freed but is still in use.

cat /sys/kernel/debug/o2dlm/locking_state dlm_thread
lockres_seq_start
    -> lock dlm->track_lock
    -> get resA
                                                resA->refs decrease to 0,
                                                call dlm_lockres_release,
                                                and wait for "cat" unlock.
Although resA->refs is already set to 0,
increase resA->refs, and then unlock
                                                lock dlm->track_lock
                                                    -> list_del_init()
                                                    -> unlock
                                                    -> free resA

In such a race case, invalid address access may occurs.  So we should
delete list res->tracking before resA->refs decrease to 0.

Signed-off-by: jiangyiwen <jiangyiwen@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit e87e805fe4a1cf38031ae0669e3a91c8a8251279)

ocfs2: make 'buffered' as the default coherency option

Orabug: 17988729

Customers upgrading to uek2 and above will see the default coherency option
set to 'full' which impacts -ve performance. This patch changes coherence
option to buffered which keeps the default behaviour same as old(UEK1).
If an application that does direct i/o needs cache coherency then they can
use mount option 'coherency=full'

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Srinivas Eeda <srinivas.eeda@oracle.com>
(cherry picked from commit 020a20029508d2d7f36470bebd23f053de4b0dbe)

xen/microcode: Use dummy microcode_ops for non initial domain guest

Orabug: 19053626

Currently non initial domain guest use Intel or AMD specific ops.
This will also slow up the startup on heavily overcommited guests (say 256VCPUs
on 20 PCPU), as there are many read and write to x86 MSR registers which will
trap to xen during microcode update. Finally it will fail and report errors.

A dummy ops could fix that and also make udevd silent (bug18379824)
by augmenting the commit c18a317f6892536851e5852b6aaa4ef42cbc11a2
"xen/microcode: Only load under initial domain." which fell short of its
intended fix.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit 62b84234f23c1020c690d162b7d8250042425e1e)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
arch/x86/kernel/microcode_core.c

xen/microcode: Fix compile warning.

We get a bunch of them. Might as well fix it.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 3b99e95e07cee4bbe3ba2b511fc2ac38ff7769b9)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

microcode_xen: Add support for AMD family >= 15h

Signed-off-by: Ian Campbell <ijc@hellion.org.uk>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 8b080aa43b95719d8981ba06f357abb6f0ba9d52)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/microcode: check proper return code.

After pulling in this change from your tree, I found the following bug,
when checking an enum value, which should be considered before inclusion:

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 33a4651e09e0d6fb6e9c1293810d8a66b734840a)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: add CPU microcode update driver

Xen does all the hard work for us, including choosing the right update
method for this cpu type and actually doing it for all cpus. We just
need to supply it with the firmware blob.

Because Xen updates all CPUs (and the kernel's virtual cpu numbers have
no fixed relationship with the underlying physical cpus), we only bother
doing anything for cpu "0".

[ Impact: allow CPU microcode update in Xen dom0 ]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Conflicts:
arch/x86/xen/Kconfig

(cherry picked from commit da3d1c83399886c443cbf9e57455bcc2e5caf28c)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
arch/x86/include/asm/microcode.h
arch/x86/kernel/Makefile
arch/x86/kernel/microcode_core.c
arch/x86/xen/Kconfig

x86/xen: Disable APIC PM for Xen PV guests

Xen PV guests support only few APIC registers and writes to
unsupported registers result in WARN_ONs. Most APIC accesses in these
guests have been eliminated; however, lapic_suspend/resume are still
called (on 32-bit kernels).

We can disable APIC power management in xen_smp_prepare_boot_cpu()
(which is called after APIC has been initialized).

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pvhvm: Support more than 32 VCPUs when migrating (v3).

When Xen migrates an HVM guest, by default its shared_info can
only hold up to 32 CPUs. As such the hypercall
VCPUOP_register_vcpu_info was introduced which allowed us to
setup per-page areas for VCPUs. This means we can boot PVHVM
guest with more than 32 VCPUs. During migration the per-cpu
structure is allocated freshly by the hypervisor (vcpu_info_mfn
is set to INVALID_MFN) so that the newly migrated guest
can make an VCPUOP_register_vcpu_info hypercall.

Unfortunatly we end up triggering this condition in Xen:
/* Run this command on yourself or on other offline VCPUS. */
if ( (v != current) && !test_bit(_VPF_down, &v->pause_flags) )

which means we are unable to setup the per-cpu VCPU structures
for running vCPUS. The Linux PV code paths make this work by
iterating over every vCPU with:

1) is target CPU up (VCPUOP_is_up hypercall?)
2) if yes, then VCPUOP_down to pause it.
3) VCPUOP_register_vcpu_info
4) if it was down, then VCPUOP_up to bring it back up

But since VCPUOP_down, VCPUOP_is_up, and VCPUOP_up are
not allowed on HVM guests we can't do this. However with the
git commit XYZ ("hvm: Support more than 32 VCPUS when migrating.")
we can do this. As such first check if VCPUOP_is_up is actually
possible before trying this dance.

As most of this dance code is done already in 'xen_setup_vcpu'
lets make it callable on both PV and HVM. This means moving one
of the checks out to 'xen_setup_runstate_info'.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

cdc-acm: Increase number of devices to 64

cdc-acm: Increase number of devices to 64

Orabug: 21219170

Increase usb acm devices to 64.

Signed-off-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

ipmi: make kcs timeout parameters as module options

ipmi: make kcs timeout parameters as module options

Orabug: 21219155

For slow or heavily used BMC contollers the default wait timeouts for IBF or OBF
bits in the driver may not be sufficient. This may cause problems during more
complicated oem operatoins on the BMC side.
These timeoutsare changed from hardcoded values in the code into kernel
module parameters. The default values are kept unchanged.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

x86: perf: prevent spurious PMU NMIs on Haswell systems

Orabug: 20996846

When "perf" is run on Haswell-based systems under UEK3, we've
noticed that "extra" NMIs are being generated by the Performance
Monitoring Unit (PMU).

The PMU contains counters that can count occurrences of certain
kinds of events, such as branch misses or instructions retired.
These counters can be programmed to issue an interrupt when they
reach certain pre-set values. linux uses vector 2, the NMI vector,
for these interrupts, so the PMU interrupts behave just like other
sources of NMIs such as watchdog timers. Each consumer of NMIs
within the kernel is responsible for identifying the interrupts
it's interested in.

In the current case, the linux PMU-support code is failing to
"claim" certain of the NMIs that are originating in the PMU.
What happens when no piece of kernel code claims an NMI is that
an ugly kernel message gets generated and, if the sysctl variable
"unknown_nmi_panic" is set nonzero (as it is by default on Exadata
systems), the system panics.

The current UEK3 PMU handler attempts to determine whether a given
NMI belongs to it by scanning the PMU hardware's potential NMI
sources to find out whether any of them has triggered. Apparently,
Haswell has potential NMI sources that are indeed getting triggered,
but of which the PMU handler is not aware.

This commit contains two measures designed to prevent these
extra NMIs.

First, we've moved the write to the local APIC's APIC_LVTPC register
from near the beginning of the PMU NMI handler to near the end.
Upstream has discovered empirically that this helps elminate
the spurious NMIs.  See:

    http://lists.openwall.net/linux-kernel/2013/06/19/712

for the original commit.

Second, this change takes advantage of a bit in the APIC_LVTPC
register that gets set when (and only when) a PMU-originated NMI
is being delivered to the CPU core.  This bit is a "mask" bit,
which when set, disables delivery of these NMIs to the core.
Having processed an NMI, system software must clear this bit in
order to enable delivery of the next one.

The fix involves sampling this bit and claiming the NMI if it's a
PMU NMI, even if its origin has not been otherwise determined.

Note that this change also helps render the PMU NMI handler immune
to the addition of more sources to the PMUs on future CPUs.

Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit ed921c01bcd2cad94dbd659ad2031a877e85acb8)

Conflict:

arch/x86/kernel/cpu/perf_event_intel.c

Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

x86/simplefb: simplefb was broken on Oracle and HP system, skip VIDEO_TYPE_EFI

Orabug: 20961435

As descriped in https://bugzilla.kernel.org/show_bug.cgi?id=98721
When kernel 4.0.4 was tested on Oracle and HP system with UEFI mode, no output and
login on console.

Simplefb was broken on these systems when orig_video_isVGA is VIDEO_TYPE_EFI, so
skip it.

This patch was tested on Oracle Sun server X5-2 series and HP ProLiant DL380 Gen9
with kernel 4.0.4

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Tested-by: Kunlun Lao <kunlun.lao@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

x86, fpu: Avoid possible error in math_state_restore()

For non-eager fpu mode, thread's fpu state is allocated during the first
fpu usage (in the context of device not available exception). This can be
a blocking call and hence we enable interrupts (which were originally
disabled when the exception happened), allocate memory and disable
interrupts etc.

Math_state_restore() is called from multiple places
and it is error pone if the caller expects interrupts to be disabled
throughout the execution of math_state_restore(). Can lead to subtle
bugs like Ubuntu bug #1265841. So simplifying the code which cause subtle
bugs.

The patch has one known problem when the machine is running baremetal
(or PVHVM) and when there is low amount of memory. The problem is that the
applications won't get SIGKILL when the FPU area can't be allocated and
instead they will continue on running - without any FPU context
allocated for them. The 'init_fpu(tsk)' can return -ENOMEM and that
patch does not check that condition. This update will be tracked
in another bug since the patch already fixes two known issues related
to corruption.

Orabug: 20270524

Signed-off-by: Suresh Siddha <sbsiddha@gmail.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Annie Li <annie.li@oracle.com>
[santosh.shilimkar@oracle.com: Added FIXME comment]
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

kernel: freezer: restore TIF_FREEZE

Ksplice needs to freeze threads while in kernel. This facility was removed
from upstream since it was no longer required, but Ksplice still needs
it.

Re-add TIF_FREEZE to allow Ksplice to freeze threads.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

ksplice: Clear garbage data on the kernel stack when handling signals

The garbage data can give false-positives for the Ksplice safety checks
making it difficult (or sometimes impossible) to apply the rebootless
updates. Clear the garbage with 0-words to avoid this.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

sched: Disable default sched_autogroup to avoid the DBA performance regression

SCHED_AUTOGROUP optimizes the scheduler for common desktop workloads by
automatically creating and populating task groups. Though it helps desktop CPU
hungry workloads(linke build jobs), we found that it crteates 10% regerssion
on DBA perfromance.

Swingbench benchmark run for OLTP shows below:

@ UEK4-with-schedauto 3.18.4-5 7073
@ UEK4-without-schedauto 3.18.4-5 7873

So to have best of both words, we make the SCHED_AUTOGROUP feature
available on UEK kernels but the default state is disabled.

One can enable it using the sysctrl (kernel.sched_autogroup_enabled)

Orabug: 20476603

Tested-by: Thomas Tanaka <thomas.tanaka@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

x86: add support for crashkernel=auto

This patch adds support for "crashkernel=auto" and was backported from RHEL7.

Orabug: 20351819

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

uek-rpm: configs: Enabel Oracle HXGE and ASM driver

While doing that just sync up the config

Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

uek-rpm: build: Add rpm build environment for ol6/ol7

Mostly imported from UEK3 uek-rpm environment with UEK4 related updates
and bug fixes.

Orabug: 20892775
Orabug: 21102340
Orabug: 20687425

Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

uek-rpm: configs: Create baseline config for uek4[ol6/ol7]

The UEK4 kernel is based on kernel.org v4.1. Brief summary about
UEK4(4.1) new/added relevant Kernel Features w.r.t UEK3(v3.8)
For more details please refer the attachment associated with bugs
and the config files from UEK git tree.

- Block Multi-Queue support
- Low Latency Socket Poll
- Full Dynamic Tick
- SLUB Memory Allocator as a default
- Networking performance Improvements [Bulk network packet transmission]
- XEN Related updates: (PV Ticket locks, pvSCSI, Xen PVH guest and Xen-netback)
- SPARC architecture updates
- Automatic NUMA balancing turned ON default.
- CGROUP improvements
- CONFIG_SCHED_STATS is turned OFF from production kernels
- TRANSPARENT_HUGEPAGE enabled with IO problem being addressed in v3.18
- CMA allows to allocate big physically-contiguous blocks of memory.
- Data Center TCP (DCTCP)though it needs support of RFC3168(ECN)
- Open vSwitch support with GRE, VXLAN and GENEVE tunneling support
- nftables, the successor of iptables
- NFSv4.2 client support, NFSv4.1 client support for migration and Ceph client caching support.
- FOO over UDP and Virtual (secure) IPv6: tunneling
- Zswap and Bcache support.
- BTRFS improvements
- X86_INTEL_MPX
- X86_VSYSCALL_EMULATION
- NET_DSA_HWMON
- NET_FOU_IP_TUNNELS
- NET_SWITCHDEV
- IPVLAN
- BLK_DEV_RAM_DAX and FS_DAX
- I40E_FCOE
- NDSD_PNFS
- DM Multipath and MQ conversions
- Very basic support for PMEM

Config related bugs with fixes part of base config
Orabug: 20064118
Orabug: 20343801
Orabug: 20343138
Orabug: 20064118
Orabug: 20064118
Orabug: 20064118
Orabug: 20064118
Orabug: 20473608
Orabug: 20516347
Orabug: 20611390
Orabug: 21233074
Orabug: 20687425

Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

oracleasm: Fix trace output for warn_asm_ioc and check_asm_ioc

The trace logic transposed the warn_asm_ioc and check_asm_ioc values. We
would treat the former as a flag and the latter as an integer. Fix this
so we print warning error codes and integrity buffer presence correctly.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

oracleasm: Fix occasional I/O stall due to merge error

Commit c05b6f12aae5 (oracleasm: Deprecate mlog and implement support for
tracepoints) inadvertently changed the maybe_wait_io logic so that we
would occasionally hang while waiting for I/O completion. Make sure we
only return when there is an actual error.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

oracleasm: Classify device connectivity issues as global errors

We used to set the ASM_LOCAL_ERROR qualifier when we got ENOLINK, EBADE
or ENODEV status from the storage stack. The assumption was that the
error could be caused by a pulled cable or a bad switch port and that
other nodes in a cluster might still have access to the storage.

The ASM team would prefer these types of errors to be treated as global,
however, as this would be consistent with database behavior when ASMLIB
is not in the picture.

Remove the ASM_LOCAL_ERROR flag from the device connectivity error code
path.

Orabug: 20117903

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

oracleasm: Deprecate mlog and implement support for tracepoints

- Replace mlog_bug_on_msg() with BUG_ON()

- Remove mlog_entry() and mlog_exit() calls

- Introduce tracing to replace the important mlog() calls. The
following tracepoints are available:

oracleasm:disk Disk setup and teardown
oracleasm:req Internal request setup and teardown
oracleasm:bio Bios submitted to the I/O stack
oracleasm:ioc I/O descriptors from the RDBMS
oracleasm:integrity Data integrity payload setup
oracleasm:querydisk Disk properties

- Remove masklog.* and proc filesystem registration code

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

oracleasm: Abolish mlog usage in integrity.c and clean up error printing.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

oracleasm: Various code and whitespace cleanups.

No functional changes.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

oracleasm: 4.0 compat changes

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

oracleasm: Compat changes for 3.18

- file->f_dentry is now file->f_path.dentry

- bio and bip iterators replace sector and size values

- Post 3.18 integrity flags

- Add error injection flags and disk type flags

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

oracleasm: claim FMODE_EXCL access on disk during asm_open

Orabug: 19454829

asm_open_disk should take exclusive access on asm disk during open to
prevent it from getting deleted while in use.

Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

oracleasm: Restrict logical block size reporting

Hanlin pointed out that we should only report the additional logical
block size when we are in physical block size mode. When the
"use_logical_block_size" module parameter is in use we should revert to
the old behavior.

Reported-by: Hanlin Chien <hanlin.qian@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

oracleasm: Report logical block size

Report the device's logical block size in the qd_feature variable. This
allows ASM to determine whether a disk group can be imported should a
storage device change its physical block size reporting.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

oracleasm: Compat changes for 3.10

create_proc_entry() has been deprecated in favor of proc_create().

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

oracleasm: Add support for new error return codes from block/SCSI

Make sure we correctly handle the additional error codes returned by the
I/O stack.

Orabug: 17484923

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

oracleasm: Compat changes for 3.8

Update oracleasm driver to accommodate the VFS layer changes in recent
3.x kernels.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

oracleasm: Compat changes for 3.5

The inode's i_uid field was converted to a kuid_t type to support user
namespaces in kernel commit 92361636e. Change initializer type.

Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

oracleasm: Introduce module parameter for block size selection

Orabug: 15924773
We have encountered a few devices which after a firmware update
communicate different characteristics to the OS. In particular, some
devices begin to report their physical block size. This in turn will
cause oracleasm to report a different block size to ASM and mounting the
disk group will fail.

Introduce a module parameter which permits the logical block size to be
reported instead of the physical.

Signed-off-by: Martin K. Petersen<martin.petersen@oracle.com>
(cherry picked from commit 0d81a777b246ef0edbc37a7d6ca23fa14fd8fe79)

oracleasm: Data integrity support

Add data integrity support to the oracleasm driver.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit b25990f3ecd4c588b048dc04a6ccafbbcbe3b36a)

oracleasm: Fix two merge errors

Fix two bugs introduced while bringing oracleasm up to date with
mainline.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit faf0fb8aecf08f8abaf5972af888a1b17af7d7d3)

Oracle ASM Kernel Driver

Include version 2.0.7 of oracleasm.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

block: loop: support DIO & AIO

There are about 3 advantages to use direct I/O and AIO on
read/write loop's backing file:

1) double cache can be avoided, then memory usage gets
decreased a lot

2) not like user space direct I/O, there isn't cost of
pinning pages

3) avoid context switch for obtaining good throughput
- in buffered file read, random I/O top throughput is often obtained
only if they are submitted concurrently from lots of tasks; but for
sequential I/O, most of times they can be hit from page cache, so
concurrent submissions often introduce unnecessary context switch
and can't improve throughput much. There was such discussion[1]
to use non-blocking I/O to improve the problem for application.
- with direct I/O and AIO, concurrent submissions can be
avoided and random read throughput can't be affected meantime

Follows my fio test result:

1. 16 jobs fio test inside ext4 file system over loop block
1) How to run
- linux kernel: 4.1.0-rc2-next-20150506 with the patchset
- the loop block is over one image on HDD.
- linux psync, 16 jobs, size 400M, ext4 over loop block
- test result: IOPS from fio output

2) Throughput result:
        -------------------------------------------------------------
        test cases          |randread   |read   |randwrite  |write  |
        -------------------------------------------------------------
        base                |240        |8705   |3763       |20914
        -------------------------------------------------------------
        base+loop aio       |242        |9258   |4577       |21451
        -------------------------------------------------------------
3) context switch
        - context switch decreased by ~16% with loop aio for randread,
and decreased by ~33% for read

4) memory usage
- After these four tests with loop aio: ~10% memory becomes used
- After these four tests without loop aio: more than 55% memory
becomes used

2. single job fio test inside ext4 file system over loop block(for Maxim Patlasov)
1) How to run
- linux kernel: 4.1.0-rc2-next-20150506 with the patchset
- the loop block is over one image on HDD.
- linux psync, 1 job, size 4000M, ext4 over loop block
- test result: IOPS from fio output

2) Throughput result:
        -------------------------------------------------------------
        test cases          |randread   |read   |randwrite  |write  |
        -------------------------------------------------------------
        base                |109        |21180  |4192       |22782
        -------------------------------------------------------------
        base+loop aio       |114        |21018  |5404       |22670
        -------------------------------------------------------------
3) context switch
        - context switch decreased by ~10% with loop aio for randread,
and decreased by ~50% for read

4) memory usage
- After these four tests with loop aio: ~10% memory becomes used
- After these four tests without loop aio: more than 55% memory
becomes used

Both 'context switch' and 'memory usage' data are got from sar.

[1] https://lwn.net/Articles/612483/
[2] sar graph when running fio over loop without the patchset
http://kernel.ubuntu.com/~ming/block/loop-aio/v3/lo-nonaio.pdf

[3] sar graph when running fio over loop with the patchset
http://kernel.ubuntu.com/~ming/block/loop-aio/v3/lo-aio.pdf

[4] sar graph when running fio over loop without the patchset
http://kernel.ubuntu.com/~ming/block/loop-aio/v3/lo-nonaio-1job.pdf

[5] sar graph when running fio over loop with the patchset
http://kernel.ubuntu.com/~ming/block/loop-aio/v3/lo-aio-1job.pdf

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>

block: loop: prepare for supporing direct IO

This patches provides one interface for enabling direct IO
from user space:

- userspace(such as losetup) can pass 'file' which is
opened/fcntl as O_DIRECT

Also __loop_update_dio() is introduced to check if direct I/O
can be used on current loop setting.

The last big change is to introduce LO_FLAGS_DIRECT_IO flag
for userspace to know if direct IO is used to access backing
file.

Cc: linux-api@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>

block: loop: use kthread_work

The following patch will use dio/aio to submit IO to backing file,
then it needn't to schedule IO concurrently from work, so
use kthread_work for decreasing context switch cost a lot.

For non-AIO case, single thread has been used for long long time,
and it was just converted to work in v4.0, which has caused performance
regression for fedora live booting already. In discussion[1], even
though submitting I/O via work concurrently can improve random read IO
throughput, meantime it might hurt sequential read IO performance, so
better to restore to single thread behaviour.

For the following AIO support, it is better to use multi hw-queue
with per-hwq kthread than current work approach suppose there is so
high performance requirement for loop.

[1] http://marc.info/?t=143082678400002&r=1&w=2
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>

block: loop: set QUEUE_FLAG_NOMERGES for request queue of loop

It doesn't make sense to enable merge because the I/O
submitted to backing file is handled page by page.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>

fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read

When direct read IO is submitted from kernel, it is often
unnecessary to dirty pages, for example of loop, dirtying pages
have been considered in the upper filesystem(over loop) side
already, and they don't need to be dirtied again.

So this patch doesn't dirtying pages for ITER_BVEC/ITER_KVEC
direct read, and loop should be the 1st case to use ITER_BVEC/ITER_KVEC
for direct read I/O.

The patch is based on previous Dave's patch.

Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>

nfs: don't dirty kernel pages read by direct-io

Replicate the logic in the commit:
fd/direct-io: introduce should_dirty for kernel aio

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

block: loop: avoiding too many pending per work I/O

If there are too many pending per work I/O, too many
high priority work thread can be generated so that
system performance can be effected.

This patch limits the max_active parameter of workqueue as 16.

This patch fixes Fedora 22 live booting performance
regression when it is booted from squashfs over dm
based on loop, and looks the following reasons are
related with the problem:

- not like other filesyststems(such as ext4), squashfs
is a bit special, and I observed that increasing I/O jobs
to access file in squashfs only improve I/O performance a
little, but it can make big difference for ext4

- nested loop: both squashfs.img and ext3fs.img are mounted
as loop block, and ext3fs.img is inside the squashfs

- during booting, lots of tasks may run concurrently

Fixes: b5dd2f6047ca108001328aac0e8588edd15f1778
Cc: stable@vger.kernel.org (v4.0)
Cc: Justin M. Forbes <jforbes@fedoraproject.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

block: loop: convert to per-device workqueue

Documentation/workqueue.txt:
If there is dependency among multiple work items used
during memory reclaim, they should be queued to separate
wq each with WQ_MEM_RECLAIM.

Loop devices can be stacked, so we have to convert to per-device
workqueue. One example is Fedora live CD.

Fixes: b5dd2f6047ca108001328aac0e8588edd15f1778
Cc: stable@vger.kernel.org (v4.0)
Cc: Justin M. Forbes <jforbes@fedoraproject.org>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Conflicts:

drivers/block/loop.c
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

megaraid_sas: Permit large RAID0/1 requests

Orabug: 19625877

Allow max_sectors to be tuned to enable I/O request sizes of 1MB. This
is not supported for RAID5/6 volumes but the UEK[23] kernels lack an
infrastructure for communicating per-LUN request size limits. This has
been remedied by upstream commit bcdb247c6b6a.

In the meantime allow setting the max_sectors module parameter to 1MB
for Invader and Fury cards.

Signed-off-by: Kashyap Desai <kashyap.desai@avagotech.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
(cherry picked from commit 54e01aad1ef694d7ec4026d2efb5c8d19f981513)
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

megaraid_sas : Modify return value of megasas_issue_blocked_cmd() and wait_and_poll() to consider command status returned by firmware

This patch is rebased on top of recently sent 18 patches(submitted by me) for
megaraid_sas driver.

Change the return value of wait_and_poll() and megsas_issue_blocked_cmd()
based on MFI_STAT returned by firmware for that command. Earlier driver always
send return type based on command completion (but never check MFI_STAT_OK for
that command), so even if command is failed by firmware still driver will
return SUCCESS status from these functions wait_and_poll() and
megsas_issue_blocked_cmd() and if caller of these functions does not check
command status (MFI_STAT), then it may endup using invalid data returned in
DMA buffers(one of the example is megasas_ld_list_query DCMD). Best thing to
avoid this type of issue is do error handling and set proper return type from
caller function wait_and_poll() and megsas_issue_blocked_cmd().

The change proposed in this patch will fix the regression introduced in patch-
"90dc9d9 megaraid_sas : MFI MPT linked list corruption fix" inside function
megasas_ld_list_query(). Prior to this MFI MPT linked list corruption fix
patch, megasas_ld_list_query() function used to check DCMD status(returned by
firmware) but with this linked list corruption fix patch, DCMD status will not
be checked inside function megasas_ld_list_query() and introduced this issue
of wrong data being used by function megasas_ld_list_query().

Cc: <stable@vger.kernel.org>
Signed-off-by: Kashyap Desai <kashyap.desai@avagotech.com>
Signed-off-by: Sumit Saxena <sumit.saxena@avagotech.com>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: James Bottomley <JBottomley@Odin.com>

megaraid_sas : swap whole register in megasas_register_aen

Swap the whole 32 bits we read from the hardware instead of swapping
just the 16bits we care about in place later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sumit Saxena <sumit.saxena@avagotech.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: James Bottomley <JBottomley@Odin.com>

megaraid_sas : fix megasas_fire_cmd_fusion calling convention

The fusion HBAs don't really use the instance template like the other
variants, as it branches off at a much higher level. So instead of
trying to squeeze megasas_fire_cmd_fusion into the wrong calling
convention call it locally with argument data types that match what
is passed.

[jejb: fix up 32 bit compile failure]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sumit Saxena <sumit.saxena@avagotech.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: James Bottomley <JBottomley@Odin.com>

megaraid_sas : add missing byte swaps to the sriov code

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sumit Saxena <sumit.saxena@avagotech.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: James Bottomley <JBottomley@Odin.com>

megaraid_sas : bytewise or should be done on native endian variables

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sumit Saxena <sumit.saxena@avagotech.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: James Bottomley <JBottomley@Odin.com>

megaraid_sas : move endianness conversion into caller of megasas_get_seq_num

Converting structure fields in place is always a bad idea, and in this case
by moving it into the only caller we also only have to do a single byte
swap as most fields of this structure are never used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sumit Saxena <sumit.saxena@avagotech.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: James Bottomley <JBottomley@Odin.com>

megaraid_sas : add endianness conversions for all ones

Add noop conversions for all ones to make sparse happy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sumit Saxena <sumit.saxena@avagotech.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: James Bottomley <JBottomley@Odin.com>

megaraid_sas : add endianness annotations

This adds endianness annotations to all data structures, and a few
variables directly referencing them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sumit Saxena <sumit.saxena@avagotech.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: James Bottomley <JBottomley@Odin.com>

megaraid_sas : add missing __iomem annotations

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sumit Saxena <sumit.saxena@avagotech.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: James Bottomley <JBottomley@Odin.com>

megaraid_sas : megasas_complete_outstanding_ioctls() can be static

drivers/scsi/megaraid/megaraid_sas_base.c:1701:6: sparse: symbol 'megasas_complete_outstanding_ioctls' was not declared. Should it be static?
From: Christoph Hellwig <hch@lst.de>

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sumit Saxena <sumit.saxena@avagotech.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: James Bottomley <JBottomley@Odin.com>

megaraid_sas : Support for Avago's Single server High Availability product

This patch will add support for Single Server High Availability(SSHA) cluster
support. Here is the short decsription of changes done to add support for
SSHA-

1) Host will send system's Unique ID based on DMI_PRODUCT_UUID to firmware.
2) Toggle the devhandle in LDIO path for Remote LDs.

Signed-off-by: Kashyap Desai <kashyap.desai@avagotech.com>
Signed-off-by: Sumit Saxena <sumit.saxena@avagotech.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: James Bottomley <JBottomley@Odin.com>

megaraid_sas : Add release date and update driver version

This patch will upgrade the driver version and add back the release date and
sysfs hook for the same. Some internal applications uses sysfs parameter for
release date, so they were broken because of removal of release date from
sysfs.

Signed-off-by: Sumit Saxena <sumit.saxena@avagotech.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: James Bottomley <JBottomley@Odin.com>

megaraid_sas : Modify driver's meta data to reflect Avago

Signed-off-by: Sumit Saxena <sumit.saxena@avagotech.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: James Bottomley <JBottomley@Odin.com>

megaraid_sas : Use Block layer tag support for internal command indexing

megaraid_sas driver will use block layer provided tag for indexing internal
MPT frames to get any unique MPT frame tied with tag. Each IO request
submitted from SCSI mid layer will get associated MPT frame from MPT framepool
(retrieved and return back using spinlock inside megaraid_sas driver's
submission/completion call back). Getting MPT frame from MPT Frame pool is
very expensive operation because of associated spin lock operation (spinlock
overhead increase on multi NUMA node). This type of locking in driver is very
expensive call considering each IO request need - Acquire and Release of the
same lock.

With this support, in IO path driver will directly provide the unique command
index(which is based on block layer tag) and will get the MPT frame tied to
the tag and this way driver can get rid off lock, which synchronizes the
access to MPT frame pool while fetching and returning MPT frame from the pool.

This support in driver provides siginificant performance improvement(on multi
NUMA node system)on latest upstream with SCSI.MQ as well as on existing linux
distributions.

Here is the data for test executed at Avago-
- IO Tool- FIO
- 4 Socket SMC server. (4 NUMA node server)
- 12 SSDs in JBOD mode .
- 4K Rand READ, QD=32
- SCSI MQ x86_64 (Latest Upstream kernel)
- upto 300% Performance Improvement.

If IOs are running on single Node, perfromance gain is less, but as soon as
increase number of nodes, performance improvement is significant. IOs running
on all 4 NUMA nodes, with this patch applied IOPs observed was 1170K vs 344K
IOPs seen without this patch.

Logically, there are two parts of this patch- 1) Block layer tag support 2)
changes in calling convention of return_cmd. part 2 will revert the changes
done by patch- 90dc9d9 megaraid_sas : MFI MPT linked list corruption fix
because changes done in part 1 has fixed the problem of MFI MPT linked list
corruption. part 2 is very much dependent on part 1, so we decided to have
single patch for these two logical changes.

[jejb: remove chatty printk pointed out by hch]
Signed-off-by: Kashyap Desai <kashyap.desai@avagotech.com>
Signed-off-by: Sumit Saxena <sumit.saxena@avagotech.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: James Bottomley <JBottomley@Odin.com>

megaraid_sas : Enhanced few prints

Signed-off-by: Sumit Saxena <sumit.saxena@avagotech.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: James Bottomley <JBottomley@Odin.com>

megaraid_sas : Move controller's queue depth calculation in adapter specific function

Signed-off-by: Kashyap Desai <kashyap.desai@avagotech.com>
Signed-off-by: Sumit Saxena <sumit.saxena@avagotech.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: James Bottomley <JBottomley@Odin.com>

megaraid_sas : Add separate functions for building sysPD IOs and non RW LDIOs

Signed-off-by: Kashyap Desai <kashyap.desai@avagotech.com>
Signed-off-by: Sumit Saxena <sumit.saxena@avagotech.com>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: James Bottomley <JBottomley@Odin.com>

megaraid_sas : Add separate function for refiring MFI commands

This patch will add separate function for refiring MFI commands in Fusion
adapters's OCR code.

Signed-off-by: Kashyap Desai <kashyap.desai@avagotech.com>
Signed-off-by: Sumit Saxena <sumit.saxena@avagotech.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: James Bottomley <JBottomley@Odin.com>

megaraid_sas : Add separate function for setting up IRQs

This patch will create separate functions for- 1) setting up IRQs for MSI-x
interrupts 2) setting up IRQs for legacy interrupts 3) freeing up IRQs. and
enable interrupts after adapter's initialization. The reason behind
initialising adapter earlier is: by that time firmware is operational and can
send interrupts, so better to use interrupt based interface to send internal
DCMD to firmware instead of using polling method, since MFI frames' pool size
is reduced and polling method does not free up MFI frame for fusion adapters,
so sending more DCMDs with polled method may cause MFI frames's pool go out of
frames and end up failing DCMD.

Signed-off-by: Kashyap Desai <kashyap.desai@avagotech.com>
Signed-off-by: Sumit Saxena <sumit.saxena@avagotech.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: James Bottomley <JBottomley@Odin.com>

bnx2x: update fw to 7.8.2

This new firmware fixes several minor bugs:
1. In switch dependent mode, DCB priority was used to override inner vlan
    priority.
2. In switch dependent  mode, inner vlan was added in case of DCB priority
    even if outer vlan was present.
3. In switch dependent mode, outer vlan was overridden by DCB priority when
    working in STATIC COS mode while inner vlan was present.
4. iSCSI - under heavy iSCSI traffic, when TCP out-of-order condition
    occurred, it was possible for the connection to close and recover.
5. iSCSI - connections on-chip TCP establishment might have failed.
6. iSCSI - out-of-order isles might have caused on-chip TCP connections
    to fail in their graceful termination.
7. iSCSI - there was a theoretical race in which an RST packet sent from
    pure-ack queue in specific timing could cause a credit-return overflow.
8. iSCSI - not all packets were completed on a forward channel.
9. DCB - fixed for 4-port devices; Until now, wrong credit counters were
    used, causing dcb to fail.
10. Fixed false parity reported in CAM memories when operating near -5% on
    the 1.0V core supply.
11. ETS default settings are set to fairness between traffic classes (rather
    than strict priority), and uses the same chip receive buffer configuration
    for both PFC and pause.

Orabug: 21036509

Cherry picked couple of commits from UEK3 and combined them.

Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Eilon Goldstein <eilong@broadcom.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

bnx2: Update driver to use new mips firmware.

bnx2-mips-06-6.2.3 and bnx2-mips-09-6.2.1.b

New firmware fixes iSCSI problems with some LeftHand targets that don't
set TTT=0xffffffff for Data-In according to spec. Firmware generates
exception warnings for this condition and becomes very slow. This is
fixed by suppressing these warnings when using default error mask.

Orabug: 21036509

(cherry picked from commit(UEK3) c2c20ef43d00b1439631e603f8dcee9a803cd8b3)
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>
Conflicts:
drivers/net/ethernet/broadcom/bnx2.c
firmware/Makefile
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

Revert "i40e: Add FW check to disable DCB and wrap autoneg workaround with FW check"

Orabug: 21111674

This commit was included in upstream driver version 1.2.10 however it
caused a regression where the following error is logged when the driver
loads:

i40e 0000:90:00.3: get phy abilities failed, aq_err -7, advertised speed
settings may not be correct

Even the latest upstream driver is affected, so for now revert this
commit until a proper fix is accepted upstream.

This reverts commit 14a9015d3d98cb735fe304204a91ad2a24b979ac.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: Adding the hxge driver

The hxge driver
supports the Sun Blade 6000 Virtualized Multi-Fabric 10GbE M2 Network Express
Module (NEM). This NEM provides virtualized 10GbE access to 10 server module
(blades) in a Sun 6000 Blade Chassis, each blade sharing the 10GbE bandwidth.
From the perspective of each server module, it looks like it owns a Oracle
10Gb NIC interface. For more information, see
http://www.oracle.com/us/products/servers-storage/sun-blade-6000-10gbe-m2-nem-ds-080640.pdf

Signed-off-by: James Puthukattukaran <james.puthukattukaran@oracle.com>

fuse: fix typo while displaying fuse numa mount option

The mount command output on FUSE filesystem output incorrectly shows 'numa'
option together with the previous option.
This patch adds a comma separator and fixes the issue.

Orabug : 21040004

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>

fuse: add numa mount option

This patch adds numa mount option. When this option is enabled, FUSE groups
all queues and creates one set per numa node. Users of /dev/fuse should listen
on /dev/fuse from all numa nodes.

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>

fuse: modify queues, allocation and locking for multiple nodes

This patch makes provision for creating multiple instances of fuse_node
structure, which can be used in cases where a separate fuse_node has to be
created per numa node. It also introduces a new spinlock fn->lock
to synchronize elements within the fuse_node struct thus reducing contention
on fc->lock.

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>

fuse: add spinlock to protect fc reqctr

fc->lock protects other members along with sequence counter. Since the
change introduced by next few patches increases parallelism, it increases
contention on fc->lock. Having a new seq_lock spinlock to protect unique
sequence counter will reduce contention and also makes code simpler.

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>

fuse: add fuse node struct

This patch introduces new structure fuse_node, which groups some fields
from fuse_conn structure. In the next few patches, an instance of this
fuse_node struct is created per NUMA node to improve performance.

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

ocfs2: Suppress the error message from being printed in ocfs2_rename

Did same thing with Goldwyn Rodrigues last patch.

While removing a non-empty directory, the kernel dumps a message:
(mv,29521,1):ocfs2_rename:1474 ERROR: status = -39

Orabug: 16790405
Signed-off-by: Xiaowei.Hu <xiaowei.hu@oracle.com>
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
(cherry picked from commit 92a8dfa5f424bf48ae39e4749c680c0cf5db4fd6)

ocfs2: Tighten free bit calculation in the global bitmap

When clearing bits in the global bitmap, we do not test the current bit value.
This patch tightens the code by considering the possiblity that the bit being
cleared was already cleared.

Now this should not happen. But we are seeing stray instances in which free
bit count in the global bitmap exceeds the total bit count. In each instance
the bitmap is correct. Only the free bit count is incorrect.

This patch checks the current bit value and increments the free bit count
only if the bit was previously set. It also prints information to allow
us to debug further.

Orabug: 17342255

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
(cherry picked from commit d1726d8617f7d27c54d12b50c9a20b248ebd0c66)

ocfs2/trivial: Limit unaligned aio+dio write messages to once per day

It was printing more frequently.

Orabug: 17342255

Signed-off-cy: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
(cherry picked from commit 8a2aa282cf9e003337f6c6949b1c7ed78347d59c)

ocfs2/trivial: Print message indicating unaligned aio+dio write

Print a message indicating unaligned aio+dio writes. It prints a message
once per 24 hrs.

Orabug: 17342255

Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
(cherry picked from commit dd146cf4c67bfce3a1fe1495c2f68d149d1c6db0)

Linux 4.1

Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending

Pull scsi target fixes from Nicholas Bellinger:
"Apologies for the late pull request.

  Here are the outstanding target-pending fixes for v4.1 code.

  The series contains three patches from Sagi + Co that address a few
  iser-target issues that have been uncovered during recent testing at
  Mellanox.

  Patch #1 has a v3.16+ stable tag, and #2-3 have v3.10+ stable tags"

* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
  iser-target: Fix possible use-after-free
  iser-target: release stale iser connections
  iser-target: Fix variable-length response error completion

Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux

Pull drm fixes from Dave Airlie:
"A smattering of fixes,

  mgag200:
      don't accept modes that aren't aligned properly as hw can't do it

  i915:
      two regression fixes

  radeon:
      one query to allow userspace fixes
      one oops fixer for older hw with new options enabled"

* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
  drm/radeon: don't probe MST on hw we don't support it on
  drm/radeon: Add RADEON_INFO_VA_UNMAP_WORKING query
  drm/mgag200: Reject non-character-cell-aligned mode widths
  Revert "drm/i915: Don't skip request retirement if the active list is empty"
  drm/i915: Always reset vma->ggtt_view.pages cache on unbinding

Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

Pull clk fixes from Michael Turquette:
"Very late clk regression fixes for the ARM-based AT91 platform.

  These went unnoticed by me until recently, hence the late pull
  request"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: at91: fix h32mx prototype inclusion in pmc header
  clk: at91: trivial: typo in peripheral clock description
  clk: at91: fix PERIPHERAL_MAX_SHIFT definition
  clk: at91: pll: fix input range validity check

Merge tag 'sound-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"Nothing looks scary, just a few usual HD-audio regression fixes and
  fixup, in addition to a minor Kconfig dependency fix for the old MIPS
  drivers"

* tag 'sound-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda - Fix unused label skip_i915
  ALSA: hda - Fix noisy outputs on Dell XPS13 (2015 model)
  ALSA: mips: let SND_SGI_O2 select SND_PCM
  ALSA: hda - Fix audio crackles on Dell Latitude E7x40
  ALSA: hda - adding a DAC/pin preference map for a HP Envy TS machine

Merge branch 'ccf/atmel-fixes-for-4.1' of https://github.com/bbrezillon/linux-at91 into clk-fixes

clk: at91: fix h32mx prototype inclusion in pmc header

Trivial fix that prevents to compile this pmc clock driver if h32mx clock is
present but smd clock isn't.

Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Fixes: bcc5fd49a0fd ("clk: at91: add a driver for the h32mx clock")
Cc: <stable@vger.kernel.org> # 3.18+

clk: at91: trivial: typo in peripheral clock description

Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>

clk: at91: fix PERIPHERAL_MAX_SHIFT definition

Fix the PERIPHERAL_MAX_SHIFT definition (3 instead of 4) and adapt the
round_rate and set_rate logic accordingly.

Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Reported-by: "Wu, Songjun" <Songjun.Wu@atmel.com>