www.infradead.org Git - users/jedix/linux-maple.git/log

Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

locktorture: fix deadlock in 'rw_lock_irq' type

Orabug: 20811436

torture_rwlock_read_unlock_irq() must use read_unlock_irqrestore()
instead of write_unlock_irqrestore().

Use read_unlock_irqrestore() instead of write_unlock_irqrestore().

Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
(cherry picked from commit f548d99ef4f5ec8f7080e88ad07c44d16d058ddc)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

btrfs: Fix BUG_ON condition in scrub_setup_recheck_block()

pagev array in scrub_block{} is of size SCRUB_MAX_PAGES_PER_BLOCK.
page_index should be checked with the same to trigger BUG_ON().

Orabug : 22351960

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>

Btrfs: set UNWRITTEN for prealloc'ed extents in fiemap

Orabug: 22617614

We should be doing this, it's weird we hadn't been doing this.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 0d2b2372e097cd3b4150d3ec91e79ac3c5cc750e)
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>

Btrfs: fix stale dir entries after unlink, inode eviction and fsync

Orabug: 23002176

If we remove a hard link from an inode, the inode gets evicted, then
we fsync the inode and then power fail/crash, when the log tree is
replayed, the parent directory inode still has entries pointing to
the name that no longer exists, while our inode no longer has the
BTRFS_INODE_REF_KEY item matching the deleted hard link (as expected),
leaving the filesystem in an inconsistent state. The stale directory
entries can not be deleted (an attempt to delete them causes -ESTALE
errors), which makes it impossible to delete the parent directory.

This happens because we track the id of the transaction where the last
unlink operation for the inode happened (last_unlink_trans) in an
in-memory only field of the inode, that is, a value that is never
persisted in the inode item stored on the fs/subvol btree. So if an
inode is evicted and loaded again, the value for last_unlink_trans is
set to 0, which prevents the fsync from logging the parent directory
at btrfs_log_inode_parent(). So fix this by setting last_unlink_trans
to the id of the transaction that last modified the inode when we
load the inode. This is a pessimistic approach but it always ensures
correctness with the trade off of ocassional full transaction commits
when an fsync is done against the inode in the same transaction where
it was evicted and reloaded when our inode is a directory and often
logging its parent unnecessarily when our inode is not a directory.

The following test case for fstests triggers the problem:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"
  tmp=/tmp/$$
  status=1 # failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      _cleanup_flakey
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _need_to_be_root
  _supported_fs generic
  _supported_os Linux
  _require_scratch
  _require_dm_flakey
  _require_metadata_journaling $SCRATCH_DEV

  rm -f $seqres.full

  _scratch_mkfs >>$seqres.full 2>&1
  _init_flakey
  _mount_flakey

  # Create our test file with 2 hard links.
  mkdir $SCRATCH_MNT/testdir
  touch $SCRATCH_MNT/testdir/foo
  ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/bar

  # Make sure everything done so far is durably persisted.
  sync

  # Now remove one of the links, trigger inode eviction and then fsync
  # our inode.
  unlink $SCRATCH_MNT/testdir/bar
  echo 2 > /proc/sys/vm/drop_caches
  $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/foo

  # Silently drop all writes on our scratch device to simulate a power failure.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey

  # Allow writes again and mount the fs to trigger log/journal replay.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey

  # Now verify our directory entries.
  echo "Entries in testdir:"
  ls -1 $SCRATCH_MNT/testdir

  # If we remove our inode, its parent should become empty and therefore we should
  # be able to remove the parent.
  rm -f $SCRATCH_MNT/testdir/*
  rmdir $SCRATCH_MNT/testdir

  _unmount_flakey

  # The fstests framework will call fsck against our filesystem which will verify
  # that all metadata is in a consistent state.

  status=0
  exit

The test failed on btrfs with:

  generic/098 4s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad)
    --- tests/generic/098.out 2015-07-23 18:01:12.616175932 +0100
    +++ /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad 2015-07-23 18:04:58.924138308 +0100
    @@ -1,3 +1,6 @@
     QA output created by 098
     Entries in testdir:
    +bar
     foo
    +rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo': Stale file handle
    +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
    ...
    (Run 'diff -u tests/generic/098.out /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad'  to see the entire diff)
  _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent (see /home/fdmanana/git/hub/xfstests/results//generic/098.full)

  $ cat /home/fdmanana/git/hub/xfstests/results//generic/098.full
  (...)
  checking fs roots
  root 5 inode 258 errors 2001, no inode item, link count wrong
     unresolved ref dir 257 index 0 namelen 3 name foo filetype 1 errors 6, no dir index, no inode ref
     unresolved ref dir 257 index 3 namelen 3 name bar filetype 1 errors 5, no dir item, no inode ref
  Checking filesystem on /dev/sdc
  (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit bde6c242027b0f1d697d5333950b3a05761d40e4)
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>

lockd: get rid of reference-counted NSM RPC clients

Currently we have reference-counted per-net NSM RPC client
which created on the first monitor request and destroyed
after the last unmonitor request. It's needed because
RPC client need to know 'utsname()->nodename', but utsname()
might be NULL when nsm_unmonitor() called.

So instead of holding the rpc client we could just save nodename
in struct nlm_host and pass it to the rpc_create().
Thus ther is no need in keeping rpc client until last
unmonitor request. We could create separate RPC clients
for each monitor/unmonitor requests.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
(cherry picked from commit 0d0f4aab4e4d290138a4ae7f2ef8469e48c9a669)

Orabug: 23125040

Conflicts:
fs/lockd/svc.c

Signed-off-by: Thomas Tanaka <thomas.tanaka@oracle.com>

lockd: create NSM handles per net namespace

Commit cb7323fffa85 ("lockd: create and use per-net NSM
RPC clients on MON/UNMON requests") introduced per-net
NSM RPC clients. Unfortunately this doesn't make any sense
without per-net nsm_handle.

E.g. the following scenario could happen
Two hosts (X and Y) in different namespaces (A and B) share
the same nsm struct.

1. nsm_monitor(host_X) called => NSM rpc client created,
nsm->sm_monitored bit set.
2. nsm_mointor(host-Y) called => nsm->sm_monitored already set,
we just exit. Thus in namespace B ln->nsm_clnt == NULL.
3. host X destroyed => nsm->sm_count decremented to 1
4. host Y destroyed => nsm_unmonitor() => nsm_mon_unmon() => NULL-ptr
dereference of *ln->nsm_clnt

So this could be fixed by making per-net nsm_handles list,
instead of global. Thus different net namespaces will not be able
share the same nsm_handle.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
(cherry picked from commit 0ad95472bf169a3501991f8f33f5147f792a8116)

Orabug: 23125040

Signed-off-by: Thomas Tanaka <thomas.tanaka@oracle.com>

PCI: Set MPS to match upstream bridge

Firmware typically configures the PCIe fabric with a consistent Max Payload
Size setting based on the devices present at boot. A hot-added device
typically has the power-on default MPS setting (128 bytes), which may not
match the fabric.

The previous Linux default, in the absence of any "pci=pcie_bus_*" options,
was PCIE_BUS_TUNE_OFF, in which we never touch MPS, even for hot-added
devices.

Add a new default setting, PCIE_BUS_DEFAULT, in which we make sure every
device's MPS setting matches the upstream bridge. This makes it more
likely that a hot-added device will work in a system with optimized MPS
configuration.

Note that if we hot-add a device that only supports 128-byte MPS, it still
likely won't work because we don't reconfigure the rest of the fabric.
Booting with "pci=pcie_bus_peer2peer" is a workaround for this because it
sets MPS to 128 for everything.

[bhelgaas: changelog, new default, rework for pci_configure_device() path]
Tested-by: Keith Busch <keith.busch@intel.com>
Tested-by: Jordan Hargrave <jharg93@gmail.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
(cherry picked from commit 27d868b5e6cfaee4fec66b388e4085ff94050fa7)

Orabug: 23237033

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>

PCI: Move MPS configuration check to pci_configure_device()

Previously we checked for invalid MPS settings, i.e., a device with MPS
different than its upstream bridge, in pcie_bus_detect_mps(). We only did
this if the arch or hotplug driver called pcie_bus_configure_settings(),
and then only if PCIe bus tuning was disabled (PCIE_BUS_TUNE_OFF).

Move the MPS checking code to pci_configure_device(), so we do it in the
pci_device_add() path for every device.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
(cherry picked from commit 9dae3a97297f71e884ed8e7664955bcacb86f010)

Orabug: 23237033

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>

KEYS: Fix ASN.1 indefinite length object parsing
This fixes CVE-2016-0758.

Orabug: 23279563
CVE: CVE-2016-0758

In the ASN.1 decoder, when the length field of an ASN.1 value is extracted,
it isn't validated against the remaining amount of data before being added
to the cursor.  With a sufficiently large size indicated, the check:

datalen - dp < 2

may then fail due to integer overflow.

Fix this by checking the length indicated against the amount of remaining
data in both places a definite length is determined.

Whilst we're at it, make the following changes:

(1) Check the maximum size of extended length does not exceed the capacity
     of the variable it's being stored in (len) rather than the type that
     variable is assumed to be (size_t).

(2) Compare the EOC tag to the symbolic constant ASN1_EOC rather than the
     integer 0.

(3) To reduce confusion, move the initialisation of len outside of:

for (len = 0; n > 0; n--) {
     since it doesn't have anything to do with the loop counter n.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Acked-by: David Woodhouse <David.Woodhouse@intel.com>
Acked-by: Peter Jones <pjones@redhat.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Merge branch topic/uek-4.1/sparc of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch 'topic/uek-4.1/ocfs2' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch 'topic/uek-4.1/drivers' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch topic/uek-4.1/stable-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

RDS: TCP: Synchronize accept() and connect() paths on t_conn_lock.

Orabug 23228077

Backport of upstream commit bd7c5f983f31 ("RDS: TCP: Synchronize accept()
and connect() paths on t_conn_lock.")

An arbitration scheme for duelling SYNs is implemented as part of
commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()") which ensures that both nodes
involved will arrive at the same arbitration decision. However, this
needs to be synchronized with an outgoing SYN to be generated by
rds_tcp_conn_connect(). This commit achieves the synchronization
through the t_conn_lock mutex in struct rds_tcp_connection.

The rds_conn_state is checked in rds_tcp_conn_connect() after acquiring
the t_conn_lock mutex. A SYN is sent out only if the RDS connection is
not already UP (an UP would indicate that rds_tcp_accept_one() has
completed 3WH, so no SYN needs to be generated).

Similarly, the rds_conn_state is checked in rds_tcp_accept_one() after
acquiring the t_conn_lock mutex. The only acceptable states (to
allow continuation of the arbitration logic) are UP (i.e., outgoing SYN
was SYN-ACKed by peer after it sent us the SYN) or CONNECTING (we sent
outgoing SYN before we saw incoming SYN).

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

RDS:TCP: Synchronize rds_tcp_accept_one with rds_send_xmit when resetting t_sock

Orabug 23228077

Backport of upstream commit eb192840266f ("RDS:TCP: Synchronize
rds_tcp_accept_one with rds_send_xmit when resetting t_sock")

There is a race condition between rds_send_xmit -> rds_tcp_xmit
and the code that deals with resolution of duelling syns added
by commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()").

Specifically, we may end up derefencing a null pointer in rds_send_xmit
if we have the interleaving sequence:
         rds_tcp_accept_one                  rds_send_xmit

                                           conn is RDS_CONN_UP, so
      invoke rds_tcp_xmit

                                           tc = conn->c_transport_data
      rds_tcp_restore_callbacks
          /* reset t_sock */
      null ptr deref from tc->t_sock

The race condition can be avoided without adding the overhead of
additional locking in the xmit path: have rds_tcp_accept_one wait
for rds_tcp_xmit threads to complete before resetting callbacks.
The synchronization can be done in the same manner as rds_conn_shutdown().
First set the rds_conn_state to something other than RDS_CONN_UP
(so that new threads cannot get into rds_tcp_xmit()), then wait for
RDS_IN_XMIT to be cleared in the conn->c_flags indicating that any
threads in rds_tcp_xmit are done.

Fixes: 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()")
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

skbuff: Add pskb_extract() helper function

Orabug 23180876

Upstream commit 6fa01ccd8830 ("skbuff: Add pskb_extract() helper function")

A pattern of skb usage seen in modules such as RDS-TCP is to
extract `to_copy' bytes from the received TCP segment, starting
at some offset `off' into a new skb `clone'. This is done in
the ->data_ready callback, where the clone skb is queued up for rx on
the PF_RDS socket, while the parent TCP segment is returned unchanged
back to the TCP engine.

The existing code uses the sequence
        clone = skb_clone(..);
        pskb_pull(clone, off, ..);
        pskb_trim(clone, to_copy, ..);
with the intention of discarding the first `off' bytes. However,
skb_clone() + pskb_pull() implies pksb_expand_head(), which ends
up doing a redundant memcpy of bytes that will then get discarded
in __pskb_pull_tail().

To avoid this inefficiency, this commit adds pskb_extract() that
creates the clone, and memcpy's only the relevant header/frag/frag_list
to the start of `clone'. pskb_trim() is then invoked to trim clone
down to the requested to_copy bytes.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

RDS: TCP: Call pskb_extract() helper function

Orabug 23180876

Upstream commit 947d2756cdde ("RDS: TCP: Call pskb_extract()
helper function")

rds-stress experiments with request size 256 bytes, 8K acks,
using 16 threads show a 40% improvment when pskb_extract()
replaces the {skb_clone(..); pskb_pull(..); pskb_trim(..);}
pattern in the Rx path, so we leverage the perf gain with
this commit.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: Fix VLANs null-pointer for 57710, 57711

Orabug: 23092424

Commit 05cc5a39ddb7 "bnx2x: add vlan filtering offload" introduced
a regression in regard for vlans for 57710, 57711 adapters -
Loading 8021q module on a machine with such an adapter would cause
a null pointer dereference, as the driver mistakenly publishes it
has capabilities for vlan CTAG filtering.

Reported-by: Otto Sabart <osabart@redhat.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit ab6d7846cf80affc43b9d412fed5e25dfcf4f35d)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

RDS: support individual receive trace reporting

If application wants to get indvidual trace point, its easy
to support with existing infrastructure.

No change needed in API

Orabug: 23215779

Tested-by: Namrata Jampani <namrata.jampani@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

IB/ipoib: Add readout of statistics using ethtool

Orabug: 21498734

IPoIB collects statistics of traffic including number of packets
sent/received, number of bytes transferred, and certain errors. This
patch makes these statistics available to be queried by ethtool.

Change-Id: Ic159815fe0cc08770cd4111ec1df117b7349c154
Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Tested-by: Yuval Shaia <yuval.shaia@oracle.com>
Acked-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>

IB/ipoib: Add handling for sending of skb with many frags

Orabug: 21498734

IPoIB puts skb-fragments in SGEs adding 1 extra SGE when SG is enabled.
Current codepath assumes that the max number of SGEs a device supports
is at least MAX_SKB_FRAGS+1, there is no interaction with upper layers
to limit number of fragments in an skb if a device suports fewer SGEs.
The assumptions also lead to requesting a fixed number of SGEs when
IPoIB creates queue-pairs with SG enabled.

A fallback/slowpath is implemented using skb_linearize to
handle cases where the conversion would result in more sges than supported.

Change-Id: Ia81e69d7231987208ac298300fc5b9734f193a2d
Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com>

ocfs2: o2hb: fix hb hung time

hr_last_timeout_start should be set as the last time where hb is still OK.
When hb write timeout, hung time will be (jiffies - hr_last_timeout_start).

Oracle-bug: 21862940

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ryan Ding <ryan.ding@oracle.com>

ocfs2: o2hb: don't negotiate if last hb fail

Sometimes io error is returned when storage is down for a while.
Like for iscsi device, stroage is made offline when session timeout,
and this will make all io return -EIO. For this case, nodes shouldn't
do negotiate timeout but should fence self. So let nodes fence self
when o2hb_do_disk_heartbeat return an error, this is the same behavior
with o2hb without negotiate timer.

Oracle-bug: 21862940

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ryan Ding <ryan.ding@oracle.com>

ocfs2: o2hb: add some user/debug log

Oracle-bug: 21862940

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ryan Ding <ryan.ding@oracle.com>

ocfs2: o2hb: add NEGOTIATE_APPROVE message

This message is used to re-queue write timeout timer and negotiate timer
when all nodes suffer a write hung to storage, this makes node not fence
self if storage down.

Oracle-bug: 21862940

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ryan Ding <ryan.ding@oracle.com>

ocfs2: o2hb: add NEGO_TIMEOUT message

This message is sent to master node when non-master nodes's
negotiate timer expired. Master node records these nodes in
a bitmap which is used to do write timeout timer re-queue
decision.

Oracle-bug: 21862940

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ryan Ding <ryan.ding@oracle.com>

ocfs2: o2hb: add negotiate timer

When storage down, all nodes will fence self due to write timeout.
The negotiate timer is designed to avoid this, with it node will
wait until storage up again.

Negotiate timer working in the following way:

1. The timer expires before write timeout timer, its timeout is half
of write timeout now. It is re-queued along with write timeout timer.
If expires, it will send NEGO_TIMEOUT message to master node(node with
lowest node number). This message does nothing but marks a bit in a
bitmap recording which nodes are negotiating timeout on master node.

2. If storage down, nodes will send this message to master node, then
when master node finds its bitmap including all online nodes, it sends
NEGO_APPROVL message to all nodes one by one, this message will re-queue
write timeout timer and negotiate timer.
For any node doesn't receive this message or meets some issue when
handling this message, it will be fenced.
If storage up at any time, o2hb_thread will run and re-queue all the
timer, nothing will be affected by these two steps.

Oracle-bug: 21862940

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ryan Ding <ryan.ding@oracle.com>

tty: Fix unsafe ldisc reference via ioctl(TIOCGETD)

ioctl(TIOCGETD) retrieves the line discipline id directly from the
ldisc because the line discipline id (c_line) in termios is untrustworthy;
userspace may have set termios via ioctl(TCSETS*) without actually
changing the line discipline via ioctl(TIOCSETD).

However, directly accessing the current ldisc via tty->ldisc is
unsafe; the ldisc ptr dereferenced may be stale if the line discipline
is changing via ioctl(TIOCSETD) or hangup.

Wait for the line discipline reference (just like read() or write())
to retrieve the "current" line discipline id.

Cc: <stable@vger.kernel.org>
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 5c17c861a357e9458001f021a7afa7aab9937439)

Orabug: 23205576
CVE: CVE-2016-0723

Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

USB: fix invalid memory access in hub_activate()

Commit 8520f38099cc ("USB: change hub initialization sleeps to
delayed_work") changed the hub_activate() routine to make part of it
run in a workqueue.  However, the commit failed to take a reference to
the usb_hub structure or to lock the hub interface while doing so.  As
a result, if a hub is plugged in and quickly unplugged before the work
routine can run, the routine will try to access memory that has been
deallocated.  Or, if the hub is unplugged while the routine is
running, the memory may be deallocated while it is in active use.

This patch fixes the problem by taking a reference to the usb_hub at
the start of hub_activate() and releasing it at the end (when the work
is finished), and by locking the hub interface while the work routine
is running.  It also adds a check at the start of the routine to see
if the hub has already been disconnected, in which nothing should be
done.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Reported-by: Alexandru Cornea <alexandru.cornea@intel.com>
Tested-by: Alexandru Cornea <alexandru.cornea@intel.com>
Fixes: 8520f38099cc ("USB: change hub initialization sleeps to delayed_work")
CC: <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit e50293ef9775c5f1cf3fcc093037dd6a8c5684ea)

Orabug: 22876651
CVE: CVE-2015-8816

Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

netfilter: nf_nat_redirect: add missing NULL pointer check

Commit 8b13eddfdf04cbfa561725cfc42d6868fe896f56 ("netfilter: refactor NAT
redirect IPv4 to use it from nf_tables") has introduced a trivial logic
change which can result in the following crash.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
IP: [<ffffffffa033002d>] nf_nat_redirect_ipv4+0x2d/0xa0 [nf_nat_redirect]
PGD 3ba662067 PUD 3ba661067 PMD 0
Oops: 0000 [#1] SMP
Modules linked in: ipv6(E) xt_REDIRECT(E) nf_nat_redirect(E) xt_tcpudp(E) iptable_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack(E) ip_tables(E) x_tables(E) binfmt_misc(E) xfs(E) libcrc32c(E) evbug(E) evdev(E) psmouse(E) i2c_piix4(E) i2c_core(E) acpi_cpufreq(E) button(E) ext4(E) crc16(E) jbd2(E) mbcache(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E)
CPU: 0 PID: 2536 Comm: ip Tainted: G E 4.1.7-15.23.amzn1.x86_64 #1
Hardware name: Xen HVM domU, BIOS 4.2.amazon 05/06/2015
task: ffff8800eb438000 ti: ffff8803ba664000 task.ti: ffff8803ba664000
[...]
Call Trace:
<IRQ>
[<ffffffffa0334065>] redirect_tg4+0x15/0x20 [xt_REDIRECT]
[<ffffffffa02e2e99>] ipt_do_table+0x2b9/0x5e1 [ip_tables]
[<ffffffffa0328045>] iptable_nat_do_chain+0x25/0x30 [iptable_nat]
[<ffffffffa031777d>] nf_nat_ipv4_fn+0x13d/0x1f0 [nf_nat_ipv4]
[<ffffffffa0328020>] ? iptable_nat_ipv4_fn+0x20/0x20 [iptable_nat]
[<ffffffffa031785e>] nf_nat_ipv4_in+0x2e/0x90 [nf_nat_ipv4]
[<ffffffffa03280a5>] iptable_nat_ipv4_in+0x15/0x20 [iptable_nat]
[<ffffffff81449137>] nf_iterate+0x57/0x80
[<ffffffff814491f7>] nf_hook_slow+0x97/0x100
[<ffffffff814504d4>] ip_rcv+0x314/0x400

unsigned int
nf_nat_redirect_ipv4(struct sk_buff *skb,
...
{
...
rcu_read_lock();
indev = __in_dev_get_rcu(skb->dev);
if (indev != NULL) {
ifa = indev->ifa_list;
newdst = ifa->ifa_local; <---
}
rcu_read_unlock();
...
}

Before the commit, 'ifa' had been always checked before access. After the
commit, however, it could be accessed even if it's NULL. Interestingly,
this was once fixed in 2003.

http://marc.info/?l=netfilter-devel&m=106668497403047&w=2

In addition to the original one, we have seen the crash when packets that
need to be redirected somehow arrive on an interface which hasn't been
yet fully configured.

This change just reverts the logic to the old behavior to avoid the crash.

Fixes: 8b13eddfdf04 ("netfilter: refactor NAT redirect IPv4 to use it from nf_tables")
Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit 94f9cd81436c85d8c3a318ba92e236ede73752fc)

Orabug: 22673511
CVE: CVE-2015-8787

Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

x86/mm: Improve switch_mm() barrier comments

My previous comments were still a bit confusing and there was a
typo. Fix it up.

Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: 71b3c126e611 ("x86/mm: Add barriers and document switch_mm()-vs-flush synchronization")
Link: http://lkml.kernel.org/r/0a0b43cdcdd241c5faaaecfbcc91a155ddedc9a1.1452631609.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 4eaffdd5a5fe6ff9f95e1ab4de1ac904d5e0fa8b)

Orabug: 22673331
CVE: CVE-2016-2069

Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

x86/mm: Add barriers and document switch_mm()-vs-flush synchronization

When switch_mm() activates a new PGD, it also sets a bit that
tells other CPUs that the PGD is in use so that TLB flush IPIs
will be sent. In order for that to work correctly, the bit
needs to be visible prior to loading the PGD and therefore
starting to fill the local TLB.

Document all the barriers that make this work correctly and add
a couple that were missing.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 71b3c126e61177eb693423f2e18a1914205b165e)

Orabug: 22673331
CVE: CVE-2016-2069

Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

Merge branch topic/uek-4.1/ofed of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Revert "RDS: Make message size limit compliant with spec"

This reverts commit ef278157f938011ee08fb683311e5e31ffd29fdf.

Orabug: 23217242

From Avinash Repaka <avinash.repaka@oracle.com>:
This patch reverts the fix for Orabug: 22661521, since the fix assumes
that the memory region is always aligned on a page boundary, causing an
EMSGSIZE error when trying to register 1MB region that isn't 4KB aligned.
These issues were observed on kernel 4.1.12-39.el6uek tag.

Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Acked-by: Chuck Anderson <chuck.anderson@oracle.com>

Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

sctp: Prevent soft lockup when sctp_accept() is called during a timeout event

Orabug: 23222731
CVE: CVE-2015-8767

A case can occur when sctp_accept() is called by the user during
a heartbeat timeout event after the 4-way handshake.  Since
sctp_assoc_migrate() changes both assoc->base.sk and assoc->ep, the
bh_sock_lock in sctp_generate_heartbeat_event() will be taken with
the listening socket but released with the new association socket.
The result is a deadlock on any future attempts to take the listening
socket lock.

Note that this race can occur with other SCTP timeouts that take
the bh_lock_sock() in the event sctp_accept() is called.

BUG: soft lockup - CPU#9 stuck for 67s! [swapper:0]
...
RIP: 0010:[<ffffffff8152d48e>]  [<ffffffff8152d48e>] _spin_lock+0x1e/0x30
RSP: 0018:ffff880028323b20  EFLAGS: 00000206
RAX: 0000000000000002 RBX: ffff880028323b20 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff880028323be0 RDI: ffff8804632c4b48
RBP: ffffffff8100bb93 R08: 0000000000000000 R09: 0000000000000000
R10: ffff880610662280 R11: 0000000000000100 R12: ffff880028323aa0
R13: ffff8804383c3880 R14: ffff880028323a90 R15: ffffffff81534225
FS:  0000000000000000(0000) GS:ffff880028320000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000006df528 CR3: 0000000001a85000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff880616b70000, task ffff880616b6cab0)
Stack:
ffff880028323c40 ffffffffa01c2582 ffff880614cfb020 0000000000000000
<d> 0100000000000000 00000014383a6c44 ffff8804383c3880 ffff880614e93c00
<d> ffff880614e93c00 0000000000000000 ffff8804632c4b00 ffff8804383c38b8
Call Trace:
<IRQ>
[<ffffffffa01c2582>] ? sctp_rcv+0x492/0xa10 [sctp]
[<ffffffff8148c559>] ? nf_iterate+0x69/0xb0
[<ffffffff814974a0>] ? ip_local_deliver_finish+0x0/0x2d0
[<ffffffff8148c716>] ? nf_hook_slow+0x76/0x120
[<ffffffff814974a0>] ? ip_local_deliver_finish+0x0/0x2d0
[<ffffffff8149757d>] ? ip_local_deliver_finish+0xdd/0x2d0
[<ffffffff81497808>] ? ip_local_deliver+0x98/0xa0
[<ffffffff81496ccd>] ? ip_rcv_finish+0x12d/0x440
[<ffffffff81497255>] ? ip_rcv+0x275/0x350
[<ffffffff8145cfeb>] ? __netif_receive_skb+0x4ab/0x750
...

With lockdep debugging:

=====================================
[ BUG: bad unlock balance detected! ]
-------------------------------------
CslRx/12087 is trying to release lock (slock-AF_INET) at:
[<ffffffffa01bcae0>] sctp_generate_timeout_event+0x40/0xe0 [sctp]
but there are no more locks to release!

other info that might help us debug this:
2 locks held by CslRx/12087:
#0:  (&asoc->timers[i]){+.-...}, at: [<ffffffff8108ce1f>] run_timer_softirq+0x16f/0x3e0
#1:  (slock-AF_INET){+.-...}, at: [<ffffffffa01bcac3>] sctp_generate_timeout_event+0x23/0xe0 [sctp]

Ensure the socket taken is also the same one that is released by
saving a copy of the socket before entering the timeout event
critical section.

Signed-off-by: Karl Heiss <kheiss@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 635682a14427d241bab7bbdeebb48a7d7b91638e)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Acked-by: Chuck Anderson <chuck.anderson@oracle.com>
Conflicts:
net/sctp/sm_sideeffect.c

sparc64: Fix I/O NUMA parsing and sysfs display code.

I/O NUMA node parsing has been broke since T5 and did not work on
T7. The code also did not correctly handle PCIe root complexes
crossbar connected to multiple memory/cpu NUMA nodes. Additionally,
the numa_node attributes displayed in sysfs were incorrect.

Example: T7-4 showing round-robin spread of multiply connected root
complexes.

[ 3723.288247] /pci@305: On NUMA node 0
[ 3723.363398] /pci@304: On NUMA node 2
[ 3723.437486] /pci@307: On NUMA node 0
[ 3723.510510] /pci@306: On NUMA node 2
[ 3723.582582] /pci@313: On NUMA node 0
[ 3723.655276] /pci@308: On NUMA node 2
[ 3723.728077] /pci@302: On NUMA node 0
[ 3723.800774] /pci@30a: On NUMA node 2
[ 3723.874895] /pci@309: On NUMA node 0
[ 3723.947089] /pci@301: On NUMA node 2
[ 3724.020218] /pci@30b: On NUMA node 1
[ 3724.092902] /pci@300: On NUMA node 3
[ 3724.167630] /pci@303: On NUMA node 1
[ 3724.240287] /pci@30c: On NUMA node 3
[ 3724.312245] /pci@312: On NUMA node 1
[ 3724.384857] /pci@30e: On NUMA node 3
[ 3724.457482] /pci@30d: On NUMA node 1
[ 3724.531679] /pci@310: On NUMA node 3
[ 3724.603621] /pci@30f: On NUMA node 1
[ 3724.675695] /pci@311: On NUMA node 3

Orabug: 22748961

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>

sparc64: Set up core sibling list correctly for T7.

The important definition of core sibling is that some level of cache is shared.
The prior SPARC notion of socket was defined as highest level of shared cache.
On T7 platforms, the MD record now describes the CPUs that share the physical
socket and this is no longer tied to shared cache. This patch correctly
separates these two concepts.

Before:
[root@ca-sparc30 topology]# cat core_siblings_list
32-63,128-223

After:
[root@ca-sparc30 topology]# cat core_siblings_list
32-63

OraBug 22748950

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>

sparc64: Fix CPU package information in /sys

CPU package information in
/sys/bus/cpu/devices/cpu*/topology/physical_package_id
is inconisistent with the use by tools such as irqbalance. This patch
uses the socket ID to be consistent and useful.

Orabug: 22748950

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>

sparc64: Add 3rd level cache info to /sys

This patch pulls line size and cache size info from the machine description and
adds l3 caches files to /sys/bus/cpu/devices/cpu* directories. It also
structures the information in the same directory hierachy as x86 so that user
programs like irqbalance can find the needed information to work correctly.

> ls /sys/bus/cpu/devices/cpu*
clock_tick           l1_dcache_size       l2_cache_line_size  l3_cache_size
crash_notes          l1_icache_line_size  l2_cache_size       node0
l1_dcache_line_size  l1_icache_size       l3_cache_line_size  topology

Sample results on a T7-4:

> cat /sys/bus/cpu/devices/cpu*/l3*
64
8388608

/sys/bus/cpu/devices/cpu*/cache/index0/:
coherency_line_size  level  shared_cpu_list  shared_cpu_map  size  type

/sys/bus/cpu/devices/cpu*/cache/index1/:
coherency_line_size  level  shared_cpu_list  shared_cpu_map  size  type

/sys/bus/cpu/devices/cpu*/cache/index2/:
coherency_line_size  level  shared_cpu_list  shared_cpu_map  size  type

/sys/bus/cpu/devices/cpu*/cache/index3/:
coherency_line_size  level  shared_cpu_list  shared_cpu_map  size  type

cat /sys/bus/cpu/devices/cpu32/cache/index3/*
64
3
32-63,128-223
0,ffffffff,ffffffff,ffffffff,00000000,00000000,ffffffff,00000000
8388608
Unified

Orabug: 22748950

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>

sparc64: Add lightweight syscall mechanism for lwp_info

This patch introduces a new "light weight" system call
mechanism which has the ability to retrieve small bits
of information and/or perform minor computations without
the need for a full blown save/switch/restore context.

Solaris provides _lwp_info(), which returns basically the
same information as getrusage(RUSAGE_THREAD) but much faster.
This is used extensively by the database code, and returns
the utime and stime for the calling thread.

(This patch also provides a fast getcpu function just as
a demonstration of how additional calls might be added.
Unlike x86, there is no unprivileged instruction to do this,
and so it is a fairly expensive system call.)

Orabug: 22952506

Signed-off-by: Rob Gardner <rob.gardner@oracle.com>

sparc64: correctly recognize sparc M8 cpu

This patch detects Sparc M8 cpu type.

Orabug: 23130139

Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit d3ae0cafd1576f4660c9b44fa08b4cecee04f8a8)

sparc64: correctly recognize Sonoma chips

The following patch adds support for correctly
recognizing Sonoma chips.

cpu : Unknown SUN4V CPU
fpu : Unknown SUN4V FPU
pmu : Unknown SUN4V PMU

Orabug: 22088766

Signed-off-by: Allen Pais <allen.pais@oracle.com>

arch/sparc: Sonoma epsc group patch

Needed for Sonoma IB software support.

Orabug: 23055865
Signed-off-by: Joe Moriarty <joe.moriarty@oracle.com>
Acked-by: Babu Moger <babu.moger@oracle.com>
(cherry picked from commit bf199cb83bd83643b223a0504d2f41ed793812ef)

arch/sparc: Sonoma piggyback patch

Needed for Sonoma IB software support.

Orabug: 23055807
Signed-off-by: Joe Moriarty <joe.moriarty@oracle.com>
Acked-by: Karl Volz <karl.volz@oracle.com>
(cherry picked from commit a63fc712b3ceb2007ad5da3db6a5cae27d906208)

sparc64:piggback program generates a.out header with incorrect section sizes

piggyback in uek for SPARC generates an a.out that has section sizes that are
too large. This causes problems when booting with OpenBoot because OpenBoot
uses those sizes to map and copy the image to its specified VA and runs into
unmapped memory during the copies.

This is a minimal fix.

Orabug:21793535

Signed-off-by: Jose Marchesi <jose.marchesi@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit bd99ee7ceffb1a472ccd8841dd7011d15e7fa258)

Add sun4v_wdt watchdog driver

This driver adds sparc hypervisor watchdog support. The default
timeout is 60 seconds and the range is between 1 and
31536000 seconds. Both watchdog-resolution and
watchdog-max-timeout MD properties settings are supported.

Signed-off-by: Wim Coekaerts <wim.coekaerts@oracle.com>
Reviewed-by: Julian Calaby <julian.calaby@gmail.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit eccc96426978c0fa963f8712077ecb6247f0e57e)

Revert "Add sun4v_wdt watchdog driver"

This reverts commit d153ba1897b562594f98b91e12d62e69018ad990.

sparc/PCI: Fix for panic while enabling SR-IOV

Orabug: 22659268

We noticed this panic while enabling SR-IOV in sparc.

ixgbe 0002:03:00.0 eth4: SR-IOV enabled with 2 VFs
ixgbe 0002:03:00.0: Multiqueue Enabled: Rx Queue count = 4,
Tx Queue count = 4
ixgbevf: Intel(R) 10 Gigabit PCI Express Virtual Function
Network Driver - version 2.12.1-k
ixgbevf: Copyright (c) 2009 - 2012 Intel Corporation.
Unable to handle kernel NULL pointer dereference
tsk->{mm,active_mm}->context = 000000000000160c
tsk->{mm,active_mm}->pgd = fff8000408238000
              \|/ ____ \|/
              "@'/ .. \`@"
              /_| \__/ |_\
                 \__U_/
modprobe(3335): Oops [#1]
CPU: 2 PID: 3335 Comm: modprobe Tainted: G           OE
4.1.12-32.el6uek.sparc64 #1
task: fff8000406f35ca0 ti: fff8000404630000 task.ti: fff8000404630000
TSTATE: 0000008411001606 TPC: 0000000000438ee0 TNPC: 0000000000438f0c
Y: 00000000 Tainted: G           OE
TPC: <dma_supported+0x20/0x80>
g0: 0000000080000001 g1: 0000000000000000 g2: 00000000ffffffff
g3: 0000000000000003
g4: fff8000406f35ca0 g5: fff800041896a000 g6: fff8000404630000
g7: 0000000000000000
o0: 0000000000000000 o1: 0000000000000300 o2: 000000000000000
1o3: 0000000000000004
o4: fff80004122160d8 o5: 0000000000e37ae5 sp: fff8000404632951
ret_pc: 00000000007215d0
RPC: <pci_enable_device+0x10/0x40>
l0: fff8000404410090 l1: fff800040441013e l2: 00000000007a768c
l3: 0000000000000001
l4: fff80004046331d8 l5: fff80004046331f0 l6: e000000000000000
l7: 0040000000000000
i0: fff8000404410090 i1: ffffffffffffffff i2: fff8000404410180
i3: 00000000004ace40
i4: 0000000000000000 i5: 2000000000000000 i6: fff8000404632a0
1i7: 0000000010550ea0
I7: <ixgbevf_probe+0x80/0x4c0 [ixgbevf]>
Call Trace:
[0000000010550ea0] ixgbevf_probe+0x80/0x4c0 [ixgbevf]
[00000000007229d4] local_pci_probe+0x34/0xa0
[0000000000722ae8] pci_call_probe+0xa8/0xe0
[0000000000722dd0] pci_device_probe+0x50/0x80
[000000000079c1c0] really_probe+0x140/0x420
[000000000079c4e4] driver_probe_device+0x44/0xa0
[000000000079c5c8] __driver_attach+0x88/0xa0
[000000000079a3cc] bus_for_each_dev+0x6c/0xa0
[000000000079bd5c] driver_attach+0x1c/0x40
[000000000079ae1c] bus_add_driver+0x17c/0x220
[000000000079cd94] driver_register+0x74/0x120
[0000000000722ebc] __pci_register_driver+0x3c/0x60
[0000000010558048] ixgbevf_init_module+0x48/0x5c [ixgbevf]
[0000000000426bb8] do_one_initcall+0xb8/0x200
[00000000004e5f8c] do_init_module+0x4c/0x1c0
[00000000004e6f48] load_module+0x5e8/0x780
Disabling lock debugging due to kernel taint
Caller[0000000010550ea0]: ixgbevf_probe+0x80/0x4c0 [ixgbevf]
Caller[00000000007229d4]: local_pci_probe+0x34/0xa0
Caller[0000000000722ae8]: pci_call_probe+0xa8/0xe0
Caller[0000000000722dd0]: pci_device_probe+0x50/0x80
Caller[000000000079c1c0]: really_probe+0x140/0x420
Caller[000000000079c4e4]: driver_probe_device+0x44/0xa0
Caller[000000000079c5c8]: __driver_attach+0x88/0xa0
Caller[000000000079a3cc]: bus_for_each_dev+0x6c/0xa0
Caller[000000000079bd5c]: driver_attach+0x1c/0x40
Caller[000000000079ae1c]: bus_add_driver+0x17c/0x220
Caller[000000000079cd94]: driver_register+0x74/0x120
Caller[0000000000722ebc]: __pci_register_driver+0x3c/0x60
Caller[0000000010558048]: ixgbevf_init_module+0x48/0x5c [ixgbevf]
Caller[0000000000426bb8]: do_one_initcall+0xb8/0x200
Caller[00000000004e5f8c]: do_init_module+0x4c/0x1c0
Caller[00000000004e6f48]: load_module+0x5e8/0x780
Caller[00000000004e7184]: SyS_init_module+0xa4/0xe0
Caller[0000000000406254]: linux_sparc_syscall+0x34/0x44
Caller[0000000000103490]: 0x103490
Instruction DUMP: 8530b020  80a64002  1860000c <c200625c>
840e4001  80a08001  02400009  90102001  c45e2088
Kernel panic - not syncing: Fatal exception
Press Stop-A (L1-A) to return to the boot prom
---[ end Kernel panic - not syncing: Fatal exception

Details:
Here is the call sequence
virtfn_add->__mlx4_init_one->dma_set_mask->dma_supported

The panic happened at line 760(file arch/sparc/kernel/iommu.c)

758 int dma_supported(struct device *dev, u64 device_mask)
759 {
760         struct iommu *iommu = dev->archdata.iommu;
761         u64 dma_addr_mask = iommu->dma_addr_mask;
762
763         if (device_mask >= (1UL << 32UL))
764                 return 0;
765
766         if ((device_mask & dma_addr_mask) == dma_addr_mask)
767                 return 1;
768
769 #ifdef CONFIG_PCI
770         if (dev_is_pci(dev))
771 return pci64_dma_supported(to_pci_dev(dev), device_mask);
772 #endif
773
774         return 0;
775 }
776 EXPORT_SYMBOL(dma_supported);

Same panic happened with Intel ixgbe driver also.

SR-IOV code looks for arch specific data while enabling
VFs. When VF device is added, driver probe function makes set
of calls to initialize the pci device. Because the VF device is
added different way than the normal PF device(which happens via
of_create_pci_dev for sparc), some of the arch specific initialization
does not happen for VF device.  That causes panic when archdata is
accessed.

To fix this, I have used already defined weak function
pcibios_setup_device to copy archdata from PF to VF.
Also verified the fix.

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit be81c7e3cc48d3ff8b26021be3fd49e997743cbc)

sparc64: enable "relaxed ordering" in IOMMU mappings

Enable relaxed ordering for memory writes in IOMMU TSB entry from
dma_4v_map_page() and dma_4v_map_sg() when dma_attrs
DMA_ATTR_WEAK_ORDERING is set. This requires vPCI version 2.0 API.

Orabug: 19245907

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
(cherry picked from commit d61b9f04493d2a0508c58b6f663c86d6441e1c42)

sparc64: Enable PCI IOMMU version 2 API

Enable Version 2 of the PCI IOMMU API needed for advanced features
such as PCI Relaxed Ordering and greater than 2 GB DMA address
space per root complex.

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
(cherry picked from commit f79d44785c80c5e626ee026e4e001f0c30958a82)

sunvnet: perf tracepoint invocations to trace LDC state machine

Use sunvnet perf trace macros to monitor LDC message exchange state.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 5fa4282fdb6d30937abcf1b1a9d367aaf472178a)

sunvnet: Add support for perf LDC event tracing

Add perf event macros for support of tracing and instrumentation
of LDC state machine

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 61cf74d322a9d8ef172251e32c3008cf60964b70)

LDoms CPU Hotplug - fix interrupt redistribution.

Orabug: 22623753

- Disable cpu timer only for hot-remove and not for hot-add
- Update interrupt affinities before interrupt redistribution
- Default to simple round-robin interrupt redistribution for ldoms

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
(cherry picked from commit 40110bd3bf1d2188719cca6f7a32df7d722f42be)
(cherry picked from commit 69910784aff4cad929ed7b15b744249da57ffc01)

LDoms CPU Hotplug - dynamic mondo queue allocation.

Orabug: 22620474

- Allocate mondo queues for present cpus only at boot time
- Allocate mondo queues dynamically and with proper alignment at hot-add

Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
(cherry picked from commit 41f763e66dcbdb72632cd0675e2990085e47a527)
(cherry picked from commit fb59288f6c4e17eca98a6a38512c0c05d55ac8e9)

sparc64: bypass iommu to use 64bit address space

This patch is internal only not for UPSTREAM. This is a temporary
workaround based on UEK2 commit c1a12ed1d125
("sparc64: enable iommu bypass workaround for IB. This is temporary.")

Current design of sparc iommu is based on iommu V1 APIs which at max
can have 2G/8K DMA addresses. Due to this, kernel entity (e.g. i40e,
PSIF) requesting more than 2G/8K DMA addresses does not work at all.
This patch adds temporary workaround to remedy this issue by bypassing
iommu.

When 64bit iommu implementation is complete, this workaround will be
reverted.

Orabug: 21149316
Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
(cherry picked from commit d751c5e1e6575b1dc119383045ba488e0d30de4d)
(cherry picked from commit 2ecc8426003036609fc447c3cf2dcf54139770cf)

Merge branch topic/uek-4.1/rpm-build of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch topic/uek-4.1/xen of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

RDS: TCP: Remove unused constant

Orabug: 22993275

Upstream commit a3382e408b64 ("RDS: TCP: Remove unused constant")

RDS_TCP_DEFAULT_BUFSIZE has been unused since commit 1edd6a14d24f
("RDS-TCP: Do not bloat sndbuf/rcvbuf in rds_tcp_tune").

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket

Orabug: 22993275

Upstream commit c6a58ffed536 ("RDS: TCP: Add sysctl tunables for
sndbuf/rcvbuf on rds-tcp socket")

Add per-net sysctl tunables to set the size of sndbuf and
rcvbuf on the kernel tcp socket.

The tunables are added at /proc/sys/net/rds/tcp/rds_tcp_sndbuf
and /proc/sys/net/rds/tcp/rds_tcp_rcvbuf.

These values must be set before accept() or connect(),
and there may be an arbitrary number of existing rds-tcp
sockets when the tunable is modified. To make sure that all
connections in the netns pick up the same value for the tunable,
we reset existing rds-tcp connections in the netns, so that
they can reconnect with the new parameters.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: Make message size limit compliant with spec

This patch limits RDS messages(both RDMA & non-RDMA) and RDS MR size
to RDS_MAX_MSG_SIZE(1MB) irrespective of underlying transport layer.

Orabug: 22661521

Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Avinash Repaka <avinash.repaka@oracle.com>
Acked-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>

RDS: add flow control info to rds_info_rdma_connection

Added per connection flow_ctl_post_credit and
flow_ctl_send_credit to rds-info. These info help
in debugging RDS IB flow control. The newly added
attributes are placed at the bottom of the data
structure to ensure backward compatibility.

Orabug: 22306628

Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>

RDS: update IB flow control algorithm

The current algorithm that uses 16 as a hard-coded value
in rds_ib_advertise_credits() doesn't serve the purpose, as
post_recvs() are performed in bulk. Thus, the test
condition will always be true.

This patch moves rds_ib_advertise_credits() in to the
post_recvs() loop. Instead of updating the post_recv credits
after all the post_recvs() have completed, the post_recv
credit is being updated in log2 incremental manner.
The proposed exponential quadrupling algorithm serves as a
good compromise between early start of the peer and at the
same time reducing the amount of explicit ACKs. The credit
update explicit ACKs will be generated starting from 16,
256, 4096...etc.

The performance number below shows that this new flow
control algorithm has minimal impact performance even though
it requires additional explicit ACKs.

4 QPs, 32t, 16d -q 8448 -a 256 (rds-parameter)

HCAs flow_ctl no flow_ctl
Mellanox CX3 744K 742K
Oracle QDR M4 819K 831K

Orabug: 22306628

Suggested-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>

RDS: Add flow control in runtime debugging

Add RDS_RTD_FLOW_CNTRL feature flag in
rds_rt_debug_bitmap.

Orabug: 22306628

Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>

RDS: fix IB transport flow control

Orabug: 22306628

IB flow control is always disabled regardless of
rds_ib_sysctl_flow_control flag.
The issue is that the initial credit advertisement
annouces zero credits, because ib_recv_refill() has
not yet been called. An initial credit offering
of zero effectively disables flow control.

IB flow control is only enabled if both active and
passive connections have set the rds_ib_sysctl
flow_control flag. E.g,

Conn. A (on),  Conn. B (on)  = enable
Conn. A (off), Conn. B (on)  = disable
Conn. A (on),  Conn. B (off) = disable
Conn. A (off), Conn. B (off) = disable

Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>

xen-blkback: read from indirect descriptors only once

Since indirect descriptors are in memory shared with the frontend, the
frontend could alter the first_sect and last_sect values after they have
been validated but before they are recorded in the request. This may
result in I/O requests that overflow the foreign page, possibly
overwriting local pages when the I/O request is executed.

When parsing indirect descriptors, only read first_sect and last_sect
once.

This is part of XSA155.

(cherry-pick from 18779149101c0dd43ded43669ae2a92d21b6f9cb)
CC: stable@vger.kernel.org
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
----
v2: This is against v4.3

Merge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

mlx4_core: scale_profile should work without params set to 0

The "scale_profile" parameter is to be used to do scaling of
hca params without requiring specific tuning setting for each
one of them individually and yet allowing manual setting of
variables.

In UEK2, the module params were default zero and initialized
later from a "driver default" or a scale_profile dictated value.

In UEK4 (derived from Mellanox OFED 2.4) module params were
pre-initialized to default values and are not zero and have
to be forced to 0 for dynamic scaling to be activated.
This defeats the purpose of having a single parameter to achieve
scaling and not requiring setting individual parameters (while
retaining ability to revert to driver defaults).

The changes here (re)introduce a separate static instance
containing default parameters separate from module parameters
which are pre-initialized to zero for parameters that can scale
dynamically and to default values for others. The zero module
parameters are later initialized to either a scale_profile
governed value or driver defaults.

Orabug: 23078816

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

mlx4_core: bump default scaled value of num of cqs and srqs

Bump up the number of cqs and srqs when scale_profile parameter
is set or log_num_cq/log_num_srq are zero implying a scaled
default profile.

This allows higher scalability needed for cloud based platforms
running higher number of processes per hca.

If the individual parameters log_num_srq and/or log_num_cq are
set, they take precedence and scale_profile is ignored.

(Fixes for style problems with quoted strings split across
lines and and block comment format are also piggybacked
in this commit!)

Orabug: 23078966

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

Merge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch topic/uek-4.1/fuse of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

[PATCH 2/2] Avoid redundant call to rds_bind_lookup() in recv path.

Orabug 20930687

When RAC tries to scale RDS-TCP, they are hitting bottlenecks
due to inefficiencies in rds_bind_lookup. Each call to
rds_bind_lookup results in an irqsave/irqrestore sequence, and
when the list of RDS sockets is large, we end up having IRQs
suppressed for long intervals. This trigger flow-control assertions
and causes TX queue watchdog hangs in the sender. The current
implementation makes this even worse, by superfluously calling
rds_bind_lookup(). This patch set takes the first step to solving
this problem by avoiding one of the redundant calls to rds_bind_lookup.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: TOS fixes in failure paths when RDS-TCP and RDS-RDMA are run together

Orabug 20930687

When errors such as connection hangs or failures are encountered
over RDS-TCP,  the sending RDS, in an attempt at HA, will try to
reconnect, and trip up on all sorts of data structures intended
for ToS support. The ToS feature is currently only supported for
RDS-IB, and  unplanned/untested usage of these data
structures by RDS-TCP causes deadlocks and panics.

Until we properly design, support, and test the ToS feature for
RDS-TCP, such paths should not be wandered into. Thus this patchset
adds defensive checks to ignore rs_tos settings in rds_sendmsg() for
TCP transports, and prevents the sending of ToS heartbeat pings

Until we properly design, support, and test the ToS feature for
RDS-TCP, such paths should not be wandered into. Thus this patchset
adds defensive checks to ignore rs_tos settings in rds_sendmsg() for
TCP transports, and prevents the sending of ToS heartbeat pings
in rds_send_hb() for TCP transport.

For reference, the deadlock that can be encountered in the
hb ping path is:

[<ffffffff814846f4>] do_tcp_setsockopt+0x244/0x710  <-- wants sock_lock
[<ffffffff81119ead>] ? __alloc_pages_nodemask+0x12d/0x230
[<ffffffff8125cf4d>] ? rb_insert_color+0x9d/0x160
[<ffffffff8101cd43>] ? native_sched_clock+0x13/0x80
[<ffffffff8101c369>] ? sched_clock+0x9/0x10
[<ffffffff810e25a9>] ? trace_clock_local+0x9/0x10
[<ffffffff810e8d97>] ? rb_reserve_next_event+0x67/0x480
[<ffffffff81292a1f>] ? __alloc_and_insert_iova_range+0x17f/0x1f0
[<ffffffff81117f52>] ? get_page_from_freelist+0x1e2/0x550
[<ffffffff81484c1a>] tcp_setsockopt+0x2a/0x30
[<ffffffff8142e724>] sock_common_setsockopt+0x14/0x20
[<ffffffffa00de5cd>] rds_tcp_xmit_prepare+0x5d/0x70 [rds_tcp]
[<ffffffffa0749b35>] rds_send_xmit+0xe5/0x860 [rds]
[<ffffffffa074a37d>] rds_send_hb+0xcd/0x130 [rds]
[<ffffffffa0747a1b>] rds_recv_local+0x20b/0x330 [rds]
[<ffffffffa074851d>] rds_recv_incoming+0x7d/0x290 [rds]
[<ffffffff8101cd43>] ? native_sched_clock+0x13/0x80
[<ffffffff8101c369>] ? sched_clock+0x9/0x10
[<ffffffffa00de2c6>] rds_tcp_data_recv+0x316/0x440 [rds_tcp]
[<ffffffff8148642a>] tcp_read_sock+0xda/0x230
[<ffffffffa00ddfb0>] ? rds_tcp_recv+0x60/0x60 [rds_tcp]
[<ffffffff81430fb0>] ? sock_get_timestamp+0xc0/0xc0
[<ffffffffa00dde6d>] rds_tcp_read_sock+0x4d/0x60 [rds_tcp]
[<ffffffffa00ddefa>] rds_tcp_data_ready+0x7a/0xd0 [rds_tcp]
[<ffffffff8148df2e>] tcp_data_queue+0x2fe/0xac0
[<ffffffff814916a9>] tcp_rcv_established+0x349/0x740
[<ffffffff81499bb5>] tcp_v4_do_rcv+0x125/0x1f0
[<ffffffff8149b3e7>] tcp_v4_rcv+0x597/0x830   <---- holds sock_lock
[<ffffffff8148b56e>] ? __tcp_ack_snd_check+0x5e/0xa0
[<ffffffff814916cd>] ? tcp_rcv_established+0x36d/0x740
[<ffffffff814774ed>] ip_local_deliver_finish+0xdd/0x2a0
[<ffffffff81477730>] ip_local_deliver+0x80/0x90
[<ffffffff81476d75>] ip_rcv_finish+0x105/0x370

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

rds: rds-stress show all zeros after few minutes

Issue can be seen on platforms that use 8K and above page size
while rds fragment size is 4K. On those platforms single page is
shared between 2 or more rds fragments. Each fragment has it's own
offeset and rds cong map code need to take this offset to account.
Not taking this offset to account lead to reading the data fragment
as congestion map fragment and hang of the rds transmit due to far
cong map corruption.

Orabug: 23045970

Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Tested-by: Anand Bibhuti <anand.bibhuti@oracle.com>
Signed-off-by: shamir rabinovitch <shamir.rabinovitch@oracle.com>

RDS: Fix the atomicity for congestion map update

Orabug: 23022620 (UEK4)
QuickRef: 22118109 (UEK2)

Two different threads with different rds sockets may be in
rds_recv_rcvbuf_delta() via receive path. If their ports
both map to the same word in the congestion map, then
using non-atomic ops to update it could cause the map to
be incorrect. Lets use atomics to avoid such an issue.

Full credit to Wengang <wen.gang.wang@oracle.com> for
finding the issue, analysing it and also pointing out
to offending code with spin lock based fix.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: IB: Run rds_fmr_flush WQ closer to ib_device

Orabug 22269408

rds_fmr_flush workqueue is calling ib_unmap_fmr
to invalidate a list of FMRs. Today, this workqueue
can be scheduled at any CPUs. In a NUMA-aware system,
schedule this workqueue to run on a CPU core closer to
ib_device can improve performance. As for now, we use
"sequential-low" policy. This policy selects two lower
cpu cores closer to HCA. In a non-NUMA aware system,
schedule rds_fmr_flush workqueue in a fixed cpu core
improves performance.

The mapping of cpu to the rds_fmr_flush workqueue
can be enabled/disabled via sysctl and it is enable
by default. To disable the feature, use below sysctl.

rds_ib_sysctl_disable_unmap_fmr_cpu = 1

Putting down some of the rds-stress performance number
comparing default and sequential-low policy in a NUMA
system with Oracle M4 QDR and Mellanox CX3.

rds-stress 4 conns, 32 threads, 16 depths, RDMA write
and unidirectional (higher is better).

(Oracle M4 QDR)
default : 645591 IOPS
sequential-low : 806196 IOPS

(Mellanox CX3)
default : 473836 IOPS
sequential-low : 544187 IOPS

Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>

RDS: IB: support larger frag size up to 16KB

Infiniband (IB) transport supports larger message size
than RDS_FRAG_SIZE, which is usually in 4KB PAGE_SIZE.
Nevertheless, RDS always fragments each payload into
RDS_FRAG_SIZE before hands it over to the underlying
IB HCA.

One of the important message size required for database
is 8448 (8K + 256B control message) for BCOPY. This RDS
message, even with IB transport, will generate three
IB work requests (WR) with each having its own RDS header.
This series of patches improve RDS performance by allowing
IB transport to send/receive RDS message with a larger
RDS_FRAG_SIZE (Ideally, using a single WR).

In order to maintain the backward compatibility and
interoperability between various RDS versions, and at
the same time to support various FRAG_SIZE, the IB
fragment size is negotiated per connection.
Although IB is capable of supporting 4GB of message size,
currently we limit the IB RDS_FRAG_SIZE up to 16KB due to
two reasons:-
1. This is needed for current 8448 RDS message size usecase.
2. Minizing the size for each receive queue entry in order
to optimal memory usage.

In term of implementation, The 'dp_reserved2' field of
'struct rds_ib_connect_private' now carries information about
supported IB fragment size. Since we are just
using the IB connection private data and a reserved field,
the protocol version is not bumped up. Furthermore, the feature
is enabled only for RDS_PROTOCOL_v4.1 and above (future).

To keep thing simpler for user, a module parameter
'rds_ib_max_frag' is provided. Without module parameter,
the default PAGE_SIZE frag will be used. During the connection
establishment, the smallest fragment size will be
chosen. If the fragment size is 0, it means RDS module
doesn't support large fragment size and the default
RDS_FRAG_SIZE will be used.

Upto ~10+ % improvement seen with Orion and ~9+ % with RDBMS
update queries.

Orabug: 21894138

Reviwed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: IB: add frag size to per connection info

rds-toools (rds-info) needs tobe updated to display per IB connection
fragment info. Have a patch for it and will get that merged accordingly.

Orabug: 21894138
Reviwed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: IB: log the endpoint rds connection role

Useful to know the active and passive end points in a
RDS conection.

Orabug: 21894138
Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: IB: purge receive frag cache on connection shutdown

RDS IB connections can be formed with different fragment size across
reconnect and hence the current frag cache needs to be purged to
avoid stale frag usages.

Orabug: 21894138
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: IB: use i_frag_sz for cache stat updates

Updates the receive cache statistics based on ic->i_frag_sz

Orabug: 21894138
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: IB: scale rds_ib_allocation based on fragment size

The 'rds_ib_sysctl_max_recv_allocation' allocation is used to manage
and allocate the size of IB receive queue entry (RQE) for each IB
connection. However, it relies on the hardcoded RDS_FRAG_SIZE.

Lets make it scalable based on supported fragment sizes for different
IB connection. Each connection can allocate different RQE size
depending on the per connection fragment_size.

Orabug: 21894138
Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: IB: make fragment size (RDS_FRAG_SIZE) dynamic

IB fabric is capable of fragment 4GB of data payload into
send_first, send_middle and send_last. Nevertheless,
RDS fragments each payload into PAGE_SIZE, which is usually
4KB. This patch makes the RDS_FRAG_SIZE for RDS IB transport
dynamic.

In the preperation for subsequent patch(es), this patch
adds per connection peer negotiation to determine the
supported fragment size for IB transport.

Orabug: 21894138
Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: log the IP address as well on bind failure

It's useful to know the IP address when RDS fails to bind a
connection. Thus, adding it to the error message.

Orabug: 21894138
Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: fix the sg allocation based on actual message size

Fix an issue where only PAGE_SIZE bytes are allocated per
scatter-gather entry (SGE) regardless of the actual message
size: Furthermore, use buddy memory allocation technique to
allocate/free memmory (if possible) to reduce SGE.

Orabug: 21894138
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: make congestion code independent of PAGE_SIZE

RDS congestion map code is designed with base assumption of
4K page size. The map update as well transport code assumes
it that way. Ofcourse it breaks when transport like IB starts
supporting larger fragments than 4K.

To overcome this limitation without too many changes to the core
congestion map update logic, define indepedent RDS_CONG_PAGE_SIZE
and use it.

While at it we also move rds_message_map_pages() whose sole
purpose it to map congestion pages to congestion code.

Orabug: 21894138
Reviwed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

RDS: Back out OoO send status fix since it causes the regression

With the DB build, the crash was observed which was boiled down to
this change on UEK2. Proactively we back this out on UEK4 as well
till the issue gets addresssed.

<0>------------[ cut here ]------------
<2>kernel BUG at net/rds/send.c:511!
<0>invalid opcode: 0000 [#1] SMP
<4>CPU 197
<4>Modules linked in: oracleacfs(P)(U) oracleadvm(P)(U) oracleoks(P)(U)
ipmi_poweroff ipmi_devintf ipmi_si ipmi_msghandler bonding acpi_cpufreq
freq_table mperf rds_rdma rds ib_sdp ib_ipoib rdma_ucm ib_ucm ib_uverbs
ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 mlx4_ib ib_sa ib_mad ib_core
mlx4_core ext3 jbd fuse ghes hed wmi i2c_i801 iTCO_wdt
iTCO_vendor_support
igb i2c_algo_bit i2c_core ixgbe hwmon dca sg ext4 mbcache jbd2 sd_mod
crc_t10dif megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last
unloaded: scsi_wait_scan]
<4>
<4>Pid: 85252, comm: oracle_85252_tp Tainted: P

[...]

<0>Call Trace:
<4> [<ffffffffa0340cab>] rds_send_drop_to+0xcb/0x470 [rds]
<4> [<ffffffffa033a19e>] rds_release+0x8e/0x110 [rds]
<4> [<ffffffff8142a369>] sock_release+0x29/0x90
<4> [<ffffffff8142a3e7>] sock_close+0x17/0x30
<4> [<ffffffff8116f2fe>] __fput+0xbe/0x240
<4> [<ffffffff8116f4a5>] fput+0x25/0x30
<4> [<ffffffff8116b3e3>] filp_close+0x63/0x90
<4> [<ffffffff8116b4c7>] sys_close+0xb7/0x120
<4> [<ffffffff81516762>] system_call_fastpath+0x16/0x1b
<0>Code: 00 75 1f 65 8b 14 25 58 c3 00 00 48 63 d2 48 c7 c0 80 26 01 00
48 03
04 d5 40 0e 99 81 48 83 40 68 01 5b 41 5c c9 c3 0f 0b eb fe <0f> 0b eb
fe 0f
1f 40 00 55 48 89 e5 48 83 ec 40 48 89 5d d8 4c
<1>RIP [<ffffffffa033e5e8>] rds_send_sndbuf_remove+0x68/0x70 [rds]
<4> RSP <ffff8825b0577dd8>

Revert "RDS: Fix out-of-order RDS_CMSG_RDMA_SEND_STATUS"

This reverts commit 5631f1a303104d41f6ded0d603011d6c172b8644.

Orabug: 21894138
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

net/mlx4_core: Modify default value of log_rdmarc_per_qp to be consistent with HW capability

This value is used to calculate max_qp_dest_rdma.
Default value of 4 brings us to 16 while HW supports 128
(max_requester_per_qp)
Although this value can be changed by module param it is best that default
will be optimized

Orabag: 19883194

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>

fuse: break infinite loop in fuse_fill_write_pages()

I got a report about unkillable task eating CPU. Further
investigation shows, that the problem is in the fuse_fill_write_pages()
function. If iov's first segment has zero length, we get an infinite
loop, because we never reach iov_iter_advance() call.

Fix this by calling iov_iter_advance() before repeating an attempt to
copy data from userspace.

A similar problem is described in 124d3b7041f ("fix writev regression:
pan hanging unkillable and un-straceable"). If zero-length segmend
is followed by segment with invalid address,
iov_iter_fault_in_readable() checks only first segment (zero-length),
iov_iter_copy_from_user_atomic() skips it, fails at second and
returns zero -> goto again without skipping zero-length segment.

Patch calls iov_iter_advance() before goto again: we'll skip zero-length
segment at second iteraction and iov_iter_fault_in_readable() will detect
invalid address.

Special thanks to Konstantin Khlebnikov, who helped a lot with the commit
description.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Maxim Patlasov <mpatlasov@parallels.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Fixes: ea9b9907b82a ("fuse: implement perform_write")
Cc: <stable@vger.kernel.org>
(cherry picked from commit 3ca8138f014a913f98e6ef40e939868e1e9ea876)

Orabug: 22673138
CVE: CVE-2015-8785

Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

proc: fix PAGE_SIZE limit of /proc/$PID/cmdline

/proc/$PID/cmdline truncates output at PAGE_SIZE. It is easy to see with

$ cat /proc/self/cmdline $(seq 1037) 2>/dev/null

However, command line size was never limited to PAGE_SIZE but to 128 KB
and relatively recently limitation was removed altogether.

People noticed and ask questions:
http://stackoverflow.com/questions/199130/how-do-i-increase-the-proc-pid-cmdline-4096-byte-limit

seq file interface is not OK, because it kmalloc's for whole output and
open + read(, 1) + sleep will pin arbitrary amounts of kernel memory. To
not do that, limit must be imposed which is incompatible with arbitrary
sized command lines.

I apologize for hairy code, but this it direct consequence of command line
layout in memory and hacks to support things like "init [3]".

The loops are "unrolled" otherwise it is either macros which hide control
flow or functions with 7-8 arguments with equal line count.

There should be real setproctitle(2) or something.

[akpm@linux-foundation.org: fix a billion min() warnings]
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Tested-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Jarod Wilson <jarod@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 23093364
Mainline commit c2c0bb44620dece7ec97e28167e32c343da22867
Acked-by: Chuck Anderson <chuck.anderson@oracle.com>

scsi_sysfs: protect against double execution of __scsi_remove_device()

On some host errors storvsc module tries to remove sdev by scheduling a job
which does the following:

   sdev = scsi_device_lookup(wrk->host, 0, 0, wrk->lun);
   if (sdev) {
       scsi_remove_device(sdev);
       scsi_device_put(sdev);
   }

While this code seems correct the following crash is observed:

general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
RIP: 0010:[<ffffffff81169979>]  [<ffffffff81169979>] bdi_destroy+0x39/0x220
...
[<ffffffff814aecdc>] ? _raw_spin_unlock_irq+0x2c/0x40
[<ffffffff8127b7db>] blk_cleanup_queue+0x17b/0x270
[<ffffffffa00b54c4>] __scsi_remove_device+0x54/0xd0 [scsi_mod]
[<ffffffffa00b556b>] scsi_remove_device+0x2b/0x40 [scsi_mod]
[<ffffffffa00ec47d>] storvsc_remove_lun+0x3d/0x60 [hv_storvsc]
[<ffffffff81080791>] process_one_work+0x1b1/0x530
...

The problem comes with the fact that many such jobs (for the same device)
are being scheduled simultaneously. While scsi_remove_device() uses
shost->scan_mutex and scsi_device_lookup() will fail for a device in
SDEV_DEL state there is no protection against someone who did
scsi_device_lookup() before we actually entered __scsi_remove_device(). So
the whole scenario looks like that: two callers do simultaneous (or
preemption happens) calls to scsi_device_lookup() ant these calls succeed
for both of them, after that they try doing scsi_remove_device().
shost->scan_mutex only serializes their calls to __scsi_remove_device()
and we end up doing the cleanup path twice.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit be821fd8e62765de43cc4f0e2db363d0e30a7e9b)

Orabug: 23021563
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

ipv4: Don't do expensive useless work during inetdev destroy.

When an inetdev is destroyed, every address assigned to the interface
is removed.  And in this scenerio we do two pointless things which can
be very expensive if the number of assigned interfaces is large:

1) Address promotion.  We are deleting all addresses, so there is no
   point in doing this.

2) A full nf conntrack table purge for every address.  We only need to
   do this once, as is already caught by the existing
   masq_dev_notifier so masq_inet_event() can skip this.

Reported-by: Solar Designer <solar@openwall.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
(cherry picked from commit fbd40ea0180a2d328c5adc61414dc8bab9335ce2)

Orabug: 22933004
CVE: CVE-2016-3156
Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

fuse: Fix return value from fuse_get_user_pages()

fuse_get_user_pages() should return error or 0.

Fixes: a63f124d5524e9ddc5153ad7837734eb4c2113f0
Orabug : 22988874

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Acked-by: Srinivas Eeda <srinivas.eeda@oracle.com>

Btrfs: fix shrinking truncate when the no_holes feature is enabled

If the no_holes feature is enabled, we attempt to shrink a file to a size
that ends up in the middle of a hole and we don't have any file extent
items in the fs/subvol tree that go beyond the new file size (or any
ordered extents that will insert such file extent items), we end up not
updating the inode's disk_i_size, we only update the inode's i_size.

This means that after unmounting and mounting the filesystem, or after
the inode is evicted and reloaded, its i_size ends up being incorrect
(an inode's i_size is set to the disk_i_size field when an inode is
loaded). This happens when btrfs_truncate_inode_items() doesn't find
any file extent items to drop - in this case it never makes a call to
btrfs_ordered_update_i_size() in order to update the inode's disk_i_size.

Example reproducer:

  $ mkfs.btrfs -O no-holes -f /dev/sdd
  $ mount /dev/sdd /mnt

  # Create our test file with some data and durably persist it.
  $ xfs_io -f -c "pwrite -S 0xaa 0 128K" /mnt/foo
  $ sync

  # Append some data to the file, increasing its size, and leave a hole
  # between the old size and the start offset if the following write. So
  # our file gets a hole in the range [128Kb, 256Kb[.
  $ xfs_io -c "truncate 160K" /mnt/foo

  # We expect to see our file with a size of 160Kb, with the first 128Kb
  # of data all having the value 0xaa and the remaining 32Kb of data all
  # having the value 0x00.
  $ od -t x1 /mnt/foo
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  0500000

  # Now cleanly unmount and mount again the filesystem.
  $ umount /mnt
  $ mount /dev/sdd /mnt

  # We expect to get the same result as before, a file with a size of
  # 160Kb, with the first 128Kb of data all having the value 0xaa and the
  # remaining 32Kb of data all having the value 0x00.
  $ od -t x1 /mnt/foo
  0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  0400000

In the example above the file size/data do not match what they were before
the remount.

Fix this by always calling btrfs_ordered_update_i_size() with a size
matching the size the file was truncated to if btrfs_truncate_inode_items()
is not called for a log tree and no file extent items were dropped. This
ensures the same behaviour as when the no_holes feature is not enabled.

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Orabug: 23017131
Mainline commit c1aa45759e90b4204ab8bce027a925fc7c87d00a
Acked-by: Ashish Samant <ashish.samant@oracle.com>

Btrfs: incremental send, don't delay directory renames unnecessarily

Orabug: 22370277

Even though we delay the rename of directories when they become
descendents of other directories that were also renamed in the send
root to prevent infinite path build loops, we were doing it in cases
where this was not needed and was actually harmful resulting in
infinite path build loops as we ended up with a circular dependency
of delayed directory renames.

Consider the following reproducer:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt
  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt2

  $ mkdir /mnt/data
  $ mkdir /mnt/data/n1
  $ mkdir /mnt/data/n1/n2
  $ mkdir /mnt/data/n4
  $ mkdir /mnt/data/n1/n2/p1
  $ mkdir /mnt/data/n1/n2/p1/p2
  $ mkdir /mnt/data/t6
  $ mkdir /mnt/data/t7
  $ mkdir -p /mnt/data/t5/t7
  $ mkdir /mnt/data/t2
  $ mkdir /mnt/data/t4
  $ mkdir -p /mnt/data/t1/t3
  $ mkdir /mnt/data/p1
  $ mv /mnt/data/t1 /mnt/data/p1
  $ mkdir -p /mnt/data/p1/p2
  $ mv /mnt/data/t4 /mnt/data/p1/p2/t1
  $ mv /mnt/data/t5 /mnt/data/n4/t5
  $ mv /mnt/data/n1/n2/p1/p2 /mnt/data/n4/t5/p2
  $ mv /mnt/data/t7 /mnt/data/n4/t5/p2/t7
  $ mv /mnt/data/t2 /mnt/data/n4/t1
  $ mv /mnt/data/p1 /mnt/data/n4/t5/p2/p1
  $ mv /mnt/data/n1/n2 /mnt/data/n4/t5/p2/p1/p2/n2
  $ mv /mnt/data/n4/t5/p2/p1/p2/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1
  $ mv /mnt/data/n4/t5/t7 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7
  $ mv /mnt/data/n4/t5/p2/p1/t1/t3 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3
  $ mv /mnt/data/n4/t5/p2/p1/p2/n2/p1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1
  $ mv /mnt/data/t6 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3/t5
  $ mv /mnt/data/n4/t5/p2/p1/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t3/t1
  $ mv /mnt/data/n1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1/n1

  $ btrfs subvolume snapshot -r /mnt /mnt/snap1

  $ mv /mnt/data/n4/t1 /mnt/data/n4/t5/p2/p1/p2/n2/t1/t7/p1/t1
  $ mv /mnt/data/n4/t5/p2/p1/p2/n2/t1 /mnt/data/n4/
  $ mv /mnt/data/n4/t5/p2/p1/p2/n2 /mnt/data/n4/t1/n2
  $ mv /mnt/data/n4/t1/t7/p1 /mnt/data/n4/t1/n2/p1
  $ mv /mnt/data/n4/t1/t3/t1 /mnt/data/n4/t1/n2/t1
  $ mv /mnt/data/n4/t1/t3 /mnt/data/n4/t1/n2/t1/t3
  $ mv /mnt/data/n4/t5/p2/p1/p2 /mnt/data/n4/t1/n2/p1/p2
  $ mv /mnt/data/n4/t1/t7 /mnt/data/n4/t1/n2/p1/t7
  $ mv /mnt/data/n4/t5/p2/p1 /mnt/data/n4/t1/n2/p1/p2/p1
  $ mv /mnt/data/n4/t1/n2/t1/t3/t5 /mnt/data/n4/t1/n2/p1/p2/t5
  $ mv /mnt/data/n4/t5 /mnt/data/n4/t1/n2/p1/p2/p1/t5
  $ mv /mnt/data/n4/t1/n2/p1/p2/p1/t5/p2 /mnt/data/n4/t1/n2/p1/p2/p1/p2
  $ mv /mnt/data/n4/t1/n2/p1/p2/p1/p2/t7 /mnt/data/n4/t1/t7

  $ btrfs subvolume snapshot -r /mnt /mnt/snap2

  $ btrfs send /mnt/snap1 | btrfs receive /mnt2
  $ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive -vv /mnt2
  ERROR: send ioctl failed with -12: Cannot allocate memory

This reproducer resulted in an infinite path build loop when building the
path for inode 266 because the following circular dependency of delayed
directory renames was created:

   ino 272 <- ino 261 <- ino 259 <- ino 268 <- ino 267 <- ino 261

Where the notation "X <- Y" means the rename of inode X is delayed by the
rename of inode Y (X will be renamed after Y is renamed). This resulted
in an infinite path build loop of inode 266 because that inode has inode
261 as an ancestor in the send root and inode 261 is in the circular
dependency of delayed renames listed above.

Fix this by not delaying the rename of a directory inode if an ancestor of
the inode in the send root, which has a delayed rename operation, is not
also a descendent of the inode in the parent root.

Thanks to Robbie Ko for sending the reproducer example.
A test case for xfstests follows soon.

Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from mainline commit 80aa6027561eef12b49031d46fd6724daf1e7fb6)
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>