www.infradead.org Git - users/jedix/linux-maple.git/log

Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/upstream-cherry-picks:
  mm/hugetlb: hugetlb_no_page: rate-limit warning message
  net/vxlan: Fix kernel unaligned access in __vxlan_find_mac
  kexec: align crash_notes allocation to make it be inside one physical page
  iommu-common: Fix error code used in iommu_tbl_range_{alloc,free}().

mm/hugetlb: hugetlb_no_page: rate-limit warning message

The warning message "killed due to inadequate hugepage pool" simply
indicates that SIGBUS was sent, not that the process was forcibly killed.
If the process has a signal handler installed does not fix the problem,
this message can rapidly spam the kernel log.

On my amd64 dev machine that does not have hugepages configured, I can
reproduce the repeated warnings easily by setting vm.nr_hugepages=2 (i.e.,
4 megabytes of huge pages) and running something that sets a signal
handler and forks, like

  #include <sys/mman.h>
  #include <signal.h>
  #include <stdlib.h>
  #include <unistd.h>

  sig_atomic_t counter = 10;
  void handler(int signal)
  {
      if (counter-- == 0)
         exit(0);
  }

  int main(void)
  {
      int status;
      char *addr = mmap(NULL, 4 * 1048576, PROT_READ | PROT_WRITE,
              MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
      if (addr == MAP_FAILED) {perror("mmap"); return 1;}
      *addr = 'x';
      switch (fork()) {
         case -1:
            perror("fork"); return 1;
         case 0:
            signal(SIGBUS, handler);
            *addr = 'x';
            break;
         default:
            *addr = 'x';
            wait(&status);
            if (WIFSIGNALED(status)) {
               psignal(WTERMSIG(status), "child");
            }
            break;
      }
  }

Signed-off-by: Geoffrey Thomas <geofft@ldpreload.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 25159252
mainline v4.5 commit 910154d520c97cd0095a889e6b878041c91111a6
Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>

net/vxlan: Fix kernel unaligned access in __vxlan_find_mac

Orabug: 24593619

Backport of upstream commit 7177a3b037c7 ("net/vxlan: Fix kernel
unaligned access in __vxlan_find_mac")

__vxlan_find_mac invokes ether_addr_equal on the eth_addr field,
which triggers unaligned access messages, so rearrange vxlan_fdb
to avoid this in the most non-intrusive way.

Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit aa51c7b55e350240c2ed5a8a217688d5bfd13424)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

kexec: align crash_notes allocation to make it be inside one physical page

Port of mainline commit bbb78b8f3

Orabug: 25071982

Signed-off-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
(cherry picked from commit 45cbd27626d17d485ca0718b28add12513690921)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

iommu-common: Fix error code used in iommu_tbl_range_{alloc,free}().

The value returned from iommu_tbl_range_alloc() (and the one passed
in as a fourth argument to iommu_tbl_range_free) is not a DMA address,
it is rather an index into the IOMMU page table.

Therefore using DMA_ERROR_CODE is not appropriate.

Use a more type matching error code define, IOMMU_ERROR_CODE, and
update all users of this interface.

Orabug: 25159279

Reported-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Merge branch topic/uek-4.1/rpm-build of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/rpm-build:
uek-rpm nano: Remove vmware modules from UEK Nano
uek-rpm:enable ldmvsw module

Merge branch topic/uek-4.1/dtrace of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/dtrace:
  dtrace: eliminate need for arg counting in sdt macros
  dtrace: augment SDT probes with type information
  dtrace: import the sdt type information into per-sdt_probedesc state
  dtrace: record SDT and perf probe types in a new ELF section

Merge branch topic/uek-4.1/kernel-generic of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/kernel-generic:
x86/hpet: Reduce HPET counter read contention

Merge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/ofed:
  mlx4: avoid multiple free on id_map_ent
  xsigo: supported SGE's for LSO QP
  xsigo: Hardening driver in handling remote QP failures
  IB/cm: avoid query device in CM REQ/REP
  IB/cm: return original rnr value when RNR WA for PSIF
  IB/cm: MBIT needs to be used in network order
  IB/core: Issue DREQ when receiving REQ/REP for stale QP
  sif: cq: cleanup cqe once a kernel qp is destroyed/reset
  sif: cq: sif_poll_cq might not drain cq completely
  sif: rq: do not flush rq if it is an srq
  sif: cq: use refcnt to disable/enable cq polling
  sif: pt: Add support for single thread modified page tables
  sif: pqp: Implement handling of PQPs in error.
  sif: eps*: initialize each struct in array
  sif: query_device: Return correct #SGEs for EoIB
  sif: LSO not supported for EoIB queuepairs
  sif: pqp: Make setup/teardown function ref sif_pqp_info directly
  sif: Move the rest of the pqp setup and teardown to sif_pqp
  sif: Move sif_dfs_register beyond base init
  sif: Refactor PQP state out of sif_dev.

Merge branch 'topic/uek-4.1/drivers' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/drivers:
  NVMe: reduce queue depth as workaround for Samsung EPIC SQ errata
  bonding: "primary_reselect" with "failure" is not working properly
  IB/core: uverbs: Do not wait for uverbs dev closure during forced system shutdown.

Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/upstream-cherry-picks:
  Bluetooth: Fix potential NULL dereference in RFCOMM bind callback
  aacraid: Check size values after double-fetch from user
  mm: migrate dirty page without clear_page_dirty_for_io etc
  xen-netfront: cast grant table reference first to type int
  xen-netfront: do not cast grant table reference to signed short

Merge branch topic/uek-4.1/stable-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/stable-cherry-picks: (21 commits)
  ocfs2: fix not enough credit panic
  ocfs2: fix deadlock on mmapped page in ocfs2_write_begin_nolock()
  ocfs2/dlm: fix race between convert and migration
  ocfs2: solve a problem of crossing the boundary in updating backups
  ocfs2: use spinlock_irqsave() to downconvert lock in ocfs2_osb_dump()
  ocfs2: access orphan dinode before delete entry in ocfs2_orphan_del
  ocfs2/dlm: do not insert a new mle when another process is already migrating
  ocfs2: fix slot overwritten if storage link down during mount
  ocfs2/dlm: return appropriate value when dlm_grab() returns NULL
  ocfs2/dlm: wait until DLM_LOCK_RES_SETREF_INPROG is cleared in dlm_deref_lockres_worker
  ocfs2/dlm: fix a race between purge and migration
  ocfs2/dlm: clear migration_pending when migration target goes down
  ocfs2: fix BUG when calculate new backup super
  ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an error
  ocfs2: fix race between mount and delete node/cluster
  ocfs2/dlm: unlock lockres spinlock before dlm_lockres_put
  ocfs2: avoid access invalid address when read o2dlm debug messages
  ocfs2: fix a tiny case that inode can not removed
  ocfs2: trusted xattr missing CAP_SYS_ADMIN check
  ocfs2: set filesytem read-only when ocfs2_delete_entry failed.
  ...

mlx4: avoid multiple free on id_map_ent

orabug: 25022868
quickref: 24486198 (uek2) 25022856 (uek3)

Double free is found on id_map_ent.
Existing code makes another queue_delayed_work regardless if the worker was
successfully canceled or not. In case it's not, means the worker routine already
started to run, and then the later queued work will do the same work against
the same id_map_ent stucture. The worker routine actually does cleanup/free
on the structure, so the 2nd run of it is dengerous.

Fix is that we check if we successfully canceled previously queued work, if
that's not canceled, we don't queue the work again on the same structure.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Reviewed-by: Avinash Repaka <avinash.repaka@oracle.com>

Bluetooth: Fix potential NULL dereference in RFCOMM bind callback

Orabug: 25058887
CVE: CVE-2015-8956

addr can be NULL and it should not be dereferenced before NULL checking.

Signed-off-by: Jaganath Kanakkassery <jaganath.k@samsung.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
(cherry picked from commit 951b6a0717db97ce420547222647bcc40bf1eacd)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ocfs2: fix not enough credit panic

The following panic was caught when run ocfs2 disconfig single test
(block size 512 and cluster size 8192). ocfs2_journal_dirty() return
-ENOSPC, that means credits were used up. The total credit should
include 3 times of "num_dx_leaves" from ocfs2_dx_dir_rebalance(),
because 2 times will be consumed in ocfs2_dx_dir_transfer_leaf() and
1 time will be consumed in ocfs2_dx_dir_new_cluster()->
__ocfs2_dx_dir_new_cluster()->ocfs2_dx_dir_format_cluster(). But only
two times is included in ocfs2_dx_dir_rebalance_credits(), fix it.

[34377.331151] ------------[ cut here ]------------
[34377.332007] kernel BUG at fs/ocfs2/journal.c:775!
[34377.344107] invalid opcode: 0000 [#1] SMP
[34377.346090] Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea parport_pc parport acpi_cpufreq i2c_piix4 i2c_core pcspkr ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod
[34377.346090] CPU: 2 PID: 10601 Comm: dd Not tainted 4.1.12-71.el6uek.bug24939243.x86_64 #2
[34377.346090] Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016
[34377.346090] task: ffff8800b6de6200 ti: ffff8800a7d48000 task.ti: ffff8800a7d48000
[34377.346090] RIP: 0010:[<ffffffffa06e7397>]  [<ffffffffa06e7397>] ocfs2_journal_dirty+0xa7/0xb0 [ocfs2]
[34377.346090] RSP: 0018:ffff8800a7d4b6d8  EFLAGS: 00010286
[34377.346090] RAX: 00000000ffffffe4 RBX: 00000000814d0a9c RCX: 00000000000004f9
[34377.346090] RDX: ffffffffa008e990 RSI: ffffffffa008f1ee RDI: ffff8800622b6460
[34377.346090] RBP: ffff8800a7d4b6f8 R08: ffffffffa008f288 R09: ffff8800622b6460
[34377.346090] R10: 0000000000000000 R11: 0000000000000282 R12: 0000000002c8421e
[34377.346090] R13: ffff88006d0cad00 R14: ffff880092beef60 R15: 0000000000000070
[34377.346090] FS:  00007f9b83e92700(0000) GS:ffff8800be880000(0000) knlGS:0000000000000000
[34377.346090] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[34377.346090] CR2: 00007fb2c0d1a000 CR3: 0000000008f80000 CR4: 00000000000406e0
[34377.346090] Stack:
[34377.346090]  00000000814d0a9c ffff88005fe61e00 ffff88006e995c00 ffff880009847c00
[34377.346090]  ffff8800a7d4b768 ffffffffa06c0999 ffff88009d3c2a10 ffff88005fe61e30
[34377.346090]  ffff8800997ce500 ffff8800997ce980 ffff880092beef60 000000100000000e
[34377.346090] Call Trace:
[34377.346090]  [<ffffffffa06c0999>] ocfs2_dx_dir_transfer_leaf+0x159/0x1a0 [ocfs2]
[34377.346090]  [<ffffffffa06c3eeb>] ocfs2_dx_dir_rebalance+0xd9b/0xea0 [ocfs2]
[34377.346090]  [<ffffffffa06dedb2>] ? ocfs2_inode_cache_io_unlock+0x12/0x20 [ocfs2]
[34377.346090]  [<ffffffffa0761180>] ? ocfs2_refcount_tree_et_ops+0x60/0xfffffffffffe4b31 [ocfs2]
[34377.346090]  [<ffffffffa06e7730>] ? ocfs2_journal_access_dl+0x20/0x20 [ocfs2]
[34377.346090]  [<ffffffffa06c6b63>] ocfs2_find_dir_space_dx+0xd3/0x300 [ocfs2]
[34377.346090]  [<ffffffffa06c9709>] ocfs2_prepare_dx_dir_for_insert+0x219/0x450 [ocfs2]
[34377.346090]  [<ffffffffa06c9b16>] ocfs2_prepare_dir_for_insert+0x1d6/0x580 [ocfs2]
[34377.346090]  [<ffffffffa06dee90>] ? ocfs2_read_inode_block+0x10/0x20 [ocfs2]
[34377.346090]  [<ffffffffa06f38e2>] ocfs2_mknod+0x5a2/0x1400 [ocfs2]
[34377.346090]  [<ffffffffa06f4933>] ocfs2_create+0x73/0x180 [ocfs2]
[34377.346090]  [<ffffffff81211de8>] vfs_create+0xd8/0x100
[34377.346090]  [<ffffffff8120f5fd>] ? lookup_real+0x1d/0x60
[34377.346090]  [<ffffffff81212535>] lookup_open+0x185/0x1c0
[34377.346090]  [<ffffffff8121571d>] do_last+0x36d/0x780
[34377.346090]  [<ffffffff812a85d6>] ? security_file_alloc+0x16/0x20
[34377.346090]  [<ffffffff81215bc2>] path_openat+0x92/0x470
[34377.346090]  [<ffffffff81215fea>] do_filp_open+0x4a/0xa0
[34377.346090]  [<ffffffff8132c570>] ? find_next_zero_bit+0x10/0x20
[34377.346090]  [<ffffffff812232ec>] ? __alloc_fd+0xac/0x150
[34377.346090]  [<ffffffff8120459a>] do_sys_open+0x11a/0x230
[34377.346090]  [<ffffffff810259d3>] ? syscall_trace_enter_phase1+0x153/0x180
[34377.346090]  [<ffffffff812046ee>] SyS_open+0x1e/0x20
[34377.346090]  [<ffffffff816cb6ee>] system_call_fastpath+0x12/0x71
[34377.346090] Code: 1d 3f 29 09 00 48 85 db 74 1f 48 8b 03 0f 1f 80 00 00 00 00 48 8b 7b 08 48 83 c3 10 4c 89 e6 ff d0 48 8b 03 48 85 c0 75 eb eb 90 <0f> 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54
[34377.346090] RIP  [<ffffffffa06e7397>] ocfs2_journal_dirty+0xa7/0xb0 [ocfs2]
[34377.346090]  RSP <ffff8800a7d4b6d8>
[34377.615401] ---[ end trace 91ac5312a6ee1288 ]---
[34377.618919] Kernel panic - not syncing: Fatal exception
[34377.619910] Kernel Offset: disabled

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: fix deadlock on mmapped page in ocfs2_write_begin_nolock()

The testcase "mmaptruncate" of ocfs2-test deadlocks occasionally.

In this testcase, we create a 2*CLUSTER_SIZE file and mmap() on it;
there are 2 process repeatedly performing the following operations
respectively: one is doing memset(mmaped_addr + 2*CLUSTER_SIZE - 1, 'a',
1), while the another is playing ftruncate(fd, 2*CLUSTER_SIZE) and then
ftruncate(fd, CLUSTER_SIZE) again and again.

This is the backtrace when the deadlock happens:

   __wait_on_bit_lock+0x50/0xa0
   __lock_page+0xb7/0xc0
   ocfs2_write_begin_nolock+0x163f/0x1790 [ocfs2]
   ocfs2_page_mkwrite+0x1c7/0x2a0 [ocfs2]
   do_page_mkwrite+0x66/0xc0
   handle_mm_fault+0x685/0x1350
   __do_page_fault+0x1d8/0x4d0
   trace_do_page_fault+0x37/0xf0
   do_async_page_fault+0x19/0x70
   async_page_fault+0x28/0x30

In ocfs2_write_begin_nolock(), we first grab the pages and then allocate
disk space for this write; ocfs2_try_to_free_truncate_log() will be
called if -ENOSPC is returned; if we're lucky to get enough clusters,
which is usually the case, we start over again.

But in ocfs2_free_write_ctxt() the target page isn't unlocked, so we
will deadlock when trying to grab the target page again.

Also, -ENOMEM might be returned in ocfs2_grab_pages_for_write().
Another deadlock will happen in __do_page_mkwrite() if
ocfs2_page_mkwrite() returns non-VM_FAULT_LOCKED, and along with a
locked target page.

These two errors fail on the same path, so fix them by unlocking the
target page manually before ocfs2_free_write_ctxt().

Jan Kara helps me clear out the JBD2 part, and suggest the hint for root
cause.

Changes since v1:
1. Also put ENOMEM error case into consideration.

Link: http://lkml.kernel.org/r/1474173902-32075-1-git-send-email-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: He Gang <ghe@suse.com>
Acked-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c33f0785bf292cf1d15f4fbe42869c63e205b21c)

Conflicts:

fs/ocfs2/aops.c

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2/dlm: fix race between convert and migration

Commit ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery")
checks if lockres master has changed to identify whether new master has
finished recovery or not. This will introduce a race that right after
old master does umount ( means master will change), a new convert
request comes.

In this case, it will reset lockres state to DLM_RECOVERING and then
retry convert, and then fail with lockres->l_action being set to
OCFS2_AST_INVALID, which will cause inconsistent lock level between
ocfs2 and dlm, and then finally BUG.

Since dlm recovery will clear lock->convert_pending in
dlm_move_lockres_to_recovery_list, we can use it to correctly identify
the race case between convert and recovery. So fix it.

Fixes: ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery")
Link: http://lkml.kernel.org/r/57CE1569.8010704@huawei.com
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e6f0c6e6170fec175fe676495f29029aecdf486c)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: solve a problem of crossing the boundary in updating backups

In update_backups() there exists a problem of crossing the boundary as
follows:

we assume that lun will be resized to 1TB(cluster_size is 32kb), it will
include 0~33554431 cluster, in update_backups func, it will backup super
block in location of 1TB which is the 33554432th cluster, so the
phenomenon of crossing the boundary happens.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Xue jiufei <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 584dca3440732afa84fbca07567bb66e1453936a)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: use spinlock_irqsave() to downconvert lock in ocfs2_osb_dump()

Commit a75e9ccabd92 ("ocfs2: use spinlock irqsave for downconvert lock")
missed an unmodified place in ocfs2_osb_dump(), so it still exists a
deadlock scenario.

    ocfs2_wake_downconvert_thread
    ocfs2_rw_unlock
    ocfs2_dio_end_io
    dio_complete
    .....
    bio_endio
    req_bio_endio
    ....
    scsi_io_completion
    blk_done_softirq
    __do_softirq
    do_softirq
    irq_exit
    do_IRQ
    ocfs2_osb_dump
    cat /sys/kernel/debug/ocfs2/${uuid}/fs_state

This patch still uses spin_lock_irqsave() - replace spin_lock() to solve
this situation.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit bfd97a0320d338b2fce422adeddd512466ef2390)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: access orphan dinode before delete entry in ocfs2_orphan_del

In ocfs2_orphan_del, currently it finds and deletes entry first, and
then access orphan dir dinode.  This will have a problem once
ocfs2_journal_access_di fails.  In this case, entry will be removed from
orphan dir, but in deed the inode hasn't been deleted successfully.  In
other words, the file is missing but not actually deleted.  So we should
access orphan dinode first like unlink and rename.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jiufei Xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 074a6c655f6da12cb1123c8a84bfd8d781138800)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2/dlm: do not insert a new mle when another process is already migrating

When two processes are migrating the same lockres,
dlm_add_migration_mle() return -EEXIST, but insert a new mle in hash
list. dlm_migrate_lockres() will detach the old mle and free the new
one which is already in hash list, that will destroy the list.

Signed-off-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 32e493265b2be96404aaa478fb2913be29b06887)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: fix slot overwritten if storage link down during mount

The following case will lead to slot overwritten.

N1                               N2
mount ocfs2 volume, find and
allocate slot 0, then set
osb->slot_num to 0, begin to
write slot info to disk
                                 mount ocfs2 volume, wait for super lock
write block fail because of
storage link down, unlock
super lock
                                 got super lock and also allocate slot 0
                                 then unlock super lock

mount fail and then dismount,
since osb->slot_num is 0, try to
put invalid slot to disk. And it
will succeed if storage link
restores.
                                 N2 slot info is now overwritten

Once another node say N3 mount, it will find and allocate slot 0 again,
which will lead to mount hung because journal has already been locked by
N2.  so when write slot info failed, invalidate slot in advance to avoid
overwrite slot.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1247017f43a93eae3d64b7c25f3637dc545f5a47)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2/dlm: return appropriate value when dlm_grab() returns NULL

dlm_grab() may return NULL when the node is doing unmount.  When doing
code review, we found that some dlm handlers may return error to caller
when dlm_grab() returns NULL and make caller BUG or other problems.
Here is an example:

Node 1                                 Node 2
receives migration message
from node 3, and send
migrate request to others
                                     start unmounting

                                     receives migrate request
                                     from node 1 and call
                                     dlm_migrate_request_handler()

                                     unmount thread unregisters
                                     domain handlers and removes
                                     dlm_context from dlm_domains

                                     dlm_migrate_request_handlers()
                                     returns -EINVAL to node 1
Exit migration neither clearing the
migration state nor sending
assert master message to node 3 which
cause node 3 hung.

Signed-off-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c372f2193a2e73d5936bf37259ae63ca388b4cbc)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2/dlm: wait until DLM_LOCK_RES_SETREF_INPROG is cleared in dlm_deref_lockres_worker

Commit f3f854648de6 ("ocfs2_dlm: Ensure correct ordering of set/clear
refmap bit on lockres") still exists a race which can't ensure the
ordering is exactly correct.

Node1               Node2                    Node3
umount, migrate
lockres to Node2
                    migrate finished,
                    send migrate request
                    to Node3
                                              received migrate request,
                                              create a migration_mle,
                                              respond to Node2.
                    set DLM_LOCK_RES_SETREF_INPROG
                    and send assert master to
                    Node3
                                              delete migration_mle in
                                              assert_master_handler,
                                              Node3 umount without response
                                              dlm_thread purge
                                              this lockres, send drop
                                              deref message to Node2
                    found the flag of
                    DLM_LOCK_RES_SETREF_INPROG
                    is set, dispatch
                    dlm_deref_lockres_worker to
                    clear refmap, but in function of
                    dlm_deref_lockres_worker,
                    only if node in refmap it wait
                    DLM_LOCK_RES_SETREF_INPROG
                    to be cleared. So worker is
                    done successfully

                                              purge lockres, send
                                              assert master response
                                              to Node1, and finish umount
                    set Node3 in refmap, and it
                    won't be cleared forever, thus
                    lead to umount hung

so wait until DLM_LOCK_RES_SETREF_INPROG is cleared in
dlm_deref_lockres_worker.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b5560143385e18b4109ad6951c7719705e3dd995)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2/dlm: fix a race between purge and migration

We found a race between purge and migration when doing code review.
Node A put lockres to purgelist before receiving the migrate message
from node B which is the master.  Node A call dlm_mig_lockres_handler to
handle this message.

dlm_mig_lockres_handler
  dlm_lookup_lockres
  >>>>>> race window, dlm_run_purge_list may run and send
         deref message to master, waiting the response
  spin_lock(&res->spinlock);
  res->state |= DLM_LOCK_RES_MIGRATING;
  spin_unlock(&res->spinlock);
  dlm_mig_lockres_handler returns

  >>>>>> dlm_thread receives the response from master for the deref
  message and triggers the BUG because the lockres has the state
  DLM_LOCK_RES_MIGRATING with the following message:

dlm_purge_lockres:209 ERROR: 6633EB681FA7474A9C280A4E1A836F0F: res
M0000000000000000030c0300000000 in use after deref

Signed-off-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 30bee898f86506893883ffb8db20d8101a29b5f5)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2/dlm: clear migration_pending when migration target goes down

We have found a BUG on res->migration_pending when migrating lock
resources.  The situation is as follows.

dlm_mark_lockres_migration
  res->migration_pending = 1;
  __dlm_lockres_reserve_ast
  dlm_lockres_release_ast returns with res->migration_pending remains
      because other threads reserve asts
  wait dlm_migration_can_proceed returns 1
  >>>>>>> o2hb found that target goes down and remove target
          from domain_map
  dlm_migration_can_proceed returns 1
  dlm_mark_lockres_migrating returns -ESHOTDOWN with
      res->migration_pending still remains.

When reentering dlm_mark_lockres_migrating(), it will trigger the BUG_ON
with res->migration_pending.  So clear migration_pending when target is
down.

Signed-off-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit cc28d6d80f6ab494b10f0e2ec949eacd610f66e3)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: fix BUG when calculate new backup super

When resizing, it firstly extends the last gd.  Once it should backup
super in the gd, it calculates new backup super and update the
corresponding value.

But it currently doesn't consider the situation that the backup super is
already done.  And in this case, it still sets the bit in gd bitmap and
then decrease from bg_free_bits_count, which leads to a corrupted gd and
trigger the BUG in ocfs2_block_group_set_bits:

    BUG_ON(le16_to_cpu(bg->bg_free_bits_count) < num_bits);

So check whether the backup super is done and then do the updates.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Jiufei Xue <xuejiufei@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 5c9ee4cbf2a945271f25b89b137f2c03bbc3be33)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an error

In ocfs2_mknod_locked if '__ocfs2_mknod_locke d' returns an error, we
should reclaim the inode successfully claimed above, otherwise, the
inode never be reused. The case is described below:

ocfs2_mknod
    ocfs2_mknod_locked
        ocfs2_claim_new_inode
                Successfully claim the inode
        __ocfs2_mknod_locked
            ocfs2_journal_access_di
            Failed because of -ENOMEM or other reasons, the inode
                        lockres has not been initialized yet.

    iput(inode)
        ocfs2_evict_inode
            ocfs2_delete_inode
                ocfs2_inode_lock
                    ocfs2_inode_lock_full_nested
                        __ocfs2_cluster_lock
                                Return -EINVAL because of the inode
                                lockres has not been initialized.

                So the following operations are not performed
                ocfs2_wipe_inode
                        ocfs2_remove_inode
                                ocfs2_free_dinode
                                        ocfs2_free_suballoc_bits

Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b1529a41f777a48f95d4af29668b70ffe3360e1b)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: fix race between mount and delete node/cluster

There is a race case between mount and delete node/cluster, which will
lead o2hb_thread to malfunctioning dead loop.

    o2hb_thread
    {
        o2nm_depend_this_node();
        <<<<<< race window, node may have already been deleted, and then
               enter the loop, o2hb thread will be malfunctioning
               because of no configured nodes found.
        while (!kthread_should_stop() &&
               !reg->hr_unclean_stop && !reg->hr_aborted_start) {
    }

So check the return value of o2nm_depend_this_node() is needed.  If node
has been deleted, do not enter the loop and let mount fail.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0986fe9b50f425ec81f25a1a85aaf3574b31d801)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2/dlm: unlock lockres spinlock before dlm_lockres_put

dlm_lockres_put will call dlm_lockres_release if it is the last
reference, and then it may call dlm_print_one_lock_resource and
take lockres spinlock.

So unlock lockres spinlock before dlm_lockres_put to avoid deadlock.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b67de018b37a97548645a879c627d4188518e907)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: avoid access invalid address when read o2dlm debug messages

The following case will lead to a lockres is freed but is still in use.

cat /sys/kernel/debug/o2dlm/locking_state dlm_thread
lockres_seq_start
    -> lock dlm->track_lock
    -> get resA
                                                resA->refs decrease to 0,
                                                call dlm_lockres_release,
                                                and wait for "cat" unlock.
Although resA->refs is already set to 0,
increase resA->refs, and then unlock
                                                lock dlm->track_lock
                                                    -> list_del_init()
                                                    -> unlock
                                                    -> free resA

In such a race case, invalid address access may occurs.  So we should
delete list res->tracking before resA->refs decrease to 0.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f57a22ddecd6f26040a67e2c12880f98f88b6e00)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: fix a tiny case that inode can not removed

When running dirop_fileop_racer we found a case that inode
can not removed.

Two nodes, say Node A and Node B, mount the same ocfs2 volume.  Create
two dirs /race/1/ and /race/2/ in the filesystem.

  Node A                            Node B
  rm -r /race/2/
                                    mv /race/1/ /race/2/
  call ocfs2_unlink(), get
  the EX mode of /race/2/
                                    wait for B unlock /race/2/
  decrease i_nlink of /race/2/ to 0,
  and add inode of /race/2/ into
  orphan dir, unlock /race/2/
                                    got EX mode of /race/2/. because
                                    /race/1/ is dir, so inc i_nlink
                                    of /race/2/ and update into disk,
                                    unlock /race/2/
  because i_nlink of /race/2/
  is not zero, this inode will
  always remain in orphan dir

This patch fixes this case by test whether i_nlink of new dir is zero.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Xue jiufei <xuejiufei@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 928dda1f9433f024ac48c3d97ae683bf83dd0e42)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: trusted xattr missing CAP_SYS_ADMIN check

The trusted extended attributes are only visible to the process which
hvae CAP_SYS_ADMIN capability but the check is missing in ocfs2
xattr_handler trusted list. The check is important because this will be
used for implementing mechanisms in the userspace for which other
ordinary processes should not have access to.

Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Taesoo kim <taesoo@gatech.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0f5e7b41f91814447defc34e915fc5d6e52266d9)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: set filesytem read-only when ocfs2_delete_entry failed.

In ocfs2_rename, it will lead to an inode with two entried(old and new) if
ocfs2_delete_entry(old) failed.  Thus, filesystem will be inconsistent.

The case is described below:

ocfs2_rename
    -> ocfs2_start_trans
    -> ocfs2_add_entry(new)
    -> ocfs2_delete_entry(old)
        -> __ocfs2_journal_access *failed* because of -ENOMEM
    -> ocfs2_commit_trans

So filesystem should be set to read-only at the moment.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 807a7907114c7c703017ed7a96477a2eeb0d08e0)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

ocfs2: fix NULL pointer dereference in function ocfs2_abort_trigger()

ocfs2_abort_trigger() use bh->b_assoc_map to get sb. But there's no
function to set bh->b_assoc_map in ocfs2, it will trigger NULL pointer
dereference while calling this function. We can get sb from
bh->b_bdev->bd_super instead of b_assoc_map.

[akpm@linux-foundation.org: update comment, per Joseph]
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 74e364ad1b13fd518a0bd4e5aec56d5e8706152f)

Orabug: 24939243

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>

xsigo: supported SGE's for LSO QP

Orabug: 25029868

PSIF supports 15 SGE's in case LSO QP linearize skb
if number of frags is greater than allowed sge's

Reported-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Reviewed-by: viswa krishnamurthy <viswa.krishnamurthy@oracle.com>
Reviewed-by: UmaShankar Tumari Mahabalagiri <umashankar.mahabalagiri@oracle.com>

xsigo: Hardening driver in handling remote QP failures

Orabug: 24929076

Handle scenario's where PSIF generates batched
transmit completions.

In case of Remote QP disconnect follow this sequence
-> If there are any pending transmit completion
explicitly transition QP state to ERROR.
-> Wait for a maximum of 10 seconds for all pending
completions
(10 seconds derived from retrycount * local ack timeout)
Destroy QP
Wait for another 10 seconds(max) if completions are
not returned by hardware.

Synchronized calls to poll_tx

Added more efficiency in handling of TX queue full condtion.

Handle scenario where uVNIC removal can come in batches

Reported-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Reviewed-by: Aravind Kini <aravind.kini@oracle.com>
Reviewed-by: viswa krishnamurthy <viswa.krishnamurthy@oracle.com>
Reviewed-by: Manish Kumar Singh <mk.singh@oracle.com>
Reviewed-by: Ariel cohen <ariel.x.cohen@oracle.com>
Reviewed-by: UmaShankar Tumari Mahabalagiri <umashankar.mahabalagiri@oracle.com>

aacraid: Check size values after double-fetch from user

Orabug: 25060030

In aacraid's ioctl_send_fib() we do two fetches from userspace, one the
get the fib header's size and one for the fib itself. Later we use the
size field from the second fetch to further process the fib. If for some
reason the size from the second fetch is different than from the first
fix, we may encounter an out-of- bounds access in aac_fib_send(). We
also check the sender size to insure it is not out of bounds. This was
reported in https://bugzilla.kernel.org/show_bug.cgi?id=116751 and was
assigned CVE-2016-6480.

Reported-by: Pengfei Wang <wpengfeinudt@gmail.com>
Fixes: 7c00ffa31 '[SCSI] 2.6 aacraid: Variable FIB size (updated patch)'
Cc: stable@vger.kernel.org
Signed-off-by: Dave Carroll <david.carroll@microsemi.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit fa00c437eef8dc2e7b25f8cd868cfa405fcc2bb3)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

NVMe: reduce queue depth as workaround for Samsung EPIC SQ errata

Orabug: 25138123

Oracle discovered that the NVMe driver gets SQ completion errors eventually
leading to the device being reset, taken out of the PCI bus tree or kernel
panics when using the default SQ size of 1024 entries (64KB) for Samsung
EPIC NVMe SSDs.

PCIe analyzer tracing by Oracle and Samsung revealed an errata in Samsung's
firmware for EPIC SSDs where these invalid completion entries can occur
when the queues straddle an 8MB DMA address boundary.

This patch works around the errata by detecting these specific devices and
limiting their descriptor queue depth to 64. This is only for the Samsung
NVMe controllers used in Oracle X-series servers.

There was no noticeable performance impact of reducing queue depths to 64
for these Samsung drives, Oracle X6-2 server, and Oracle VM Server 3.4.2.

Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Signed-off-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

x86/hpet: Reduce HPET counter read contention

On a large system with many CPUs, using HPET as the clock source can
have a significant impact on the overall system performance because
of the following reasons:
1) There is a single HPET counter shared by all the CPUs.
2) HPET counter reading is a very slow operation.

Using HPET as the default clock source may happen when, for example,
the TSC clock calibration exceeds the allowable tolerance. Something
the performance slowdown can be so severe that the system may crash
because of a NMI watchdog soft lockup, for example.

During the TSC clock calibration process, the default clock source
will be set temporarily to HPET. For systems with many CPUs, it is
possible that NMI watchdog soft lockup may occur occasionally during
that short time period where HPET clocking is active as is shown in
the kernel log below:

[   71.646504] hpet0: 8 comparators, 64-bit 14.318180 MHz counter
[   71.655313] Switching to clocksource hpet
[   95.679135] BUG: soft lockup - CPU#144 stuck for 23s! [swapper/144:0]
[   95.693363] BUG: soft lockup - CPU#145 stuck for 23s! [swapper/145:0]
[   95.695580] BUG: soft lockup - CPU#582 stuck for 23s! [swapper/582:0]
[   95.698128] BUG: soft lockup - CPU#357 stuck for 23s! [swapper/357:0]

This patch addresses the above issues by reducing HPET read contention
using the fact that if more than one CPUs are trying to access HPET at
the same time, it will be more efficient when only one CPU in the group
reads the HPET counter and shares it with the rest of the group instead
of each group member trying to read the HPET counter individually.

This is done by using a combination quadword that contains a 32-bit
stored HPET value and a 32-bit spinlock.  The CPU that gets the lock
will be responsible for reading the HPET counter and storing it in
the quadword. The others will monitor the change in HPET value and
lock status and grab the latest stored HPET value accordingly. This
change is only enabled on 64-bit SMP configuration.

On a 4-socket Haswell-EX box with 144 threads (HT on), running the
AIM7 compute workload (1500 users) on a 4.8-rc1 kernel (HZ=1000)
with and without the patch has the following performance numbers
(with HPET or TSC as clock source):

TSC = 1042431 jobs/min
HPET w/o patch =  798068 jobs/min
HPET with patch = 1029445 jobs/min

The perf profile showed a reduction of the %CPU time consumed by
read_hpet from 11.19% without patch to 1.24% with patch.

[ tglx: It's really sad that we need to have such hacks just to deal with
   the fact that cpu vendors have not managed to fix the TSC wreckage
   within 15+ years. Were They Forgetting? ]

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Tested-by: Prarit Bhargava <prarit@redhat.com>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Randy Wright <rwright@hpe.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1473182530-29175-1-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit f99fd22e4d4bc84880a8a3117311bbf0e3a6a9dc)

Orbug: 24744785

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>

bonding: "primary_reselect" with "failure" is not working properly

Upstream commit b5a983f3141239b5.

When "primary_reselect" is set to "failure", primary interface should
not become active until current active slave is down. But if we set first
member of bond device as a "primary" interface and "primary_reselect"
is set to "failure" then whenever primary interface's link gets back(up)
it becomes active slave even if current active slave is still up.

With this patch, "bond_find_best_slave" will not traverse members if
primary interface is not candidate for failover/reselection and current
active slave is still up.

Orabug: 24831739

Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Mazhar Rana <mazhar.rana@cyberoam.com>
Signed-off-by: Jay Vosburgh <j.vosburgh@gmail.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>

uek-rpm nano: Remove vmware modules from UEK Nano

Orabug:24808621
Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>

mm: migrate dirty page without clear_page_dirty_for_io etc

Orabug: 25059177
CVE: CVE-2016-3070

clear_page_dirty_for_io() has accumulated writeback and memcg subtleties
since v2.6.16 first introduced page migration; and the set_page_dirty()
which completed its migration of PageDirty, later had to be moderated to
__set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too.

No actual problems seen with this procedure recently, but if you look into
what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually
achieving, it turns out to be nothing more than moving the PageDirty flag,
and its NR_FILE_DIRTY stat from one zone to another.

It would be good to avoid a pile of irrelevant decrementations and
incrementations, and improper event counting, and unnecessary descent of
the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which
radix_tree_replace_slot() left in place anyway).

Do the NR_FILE_DIRTY movement, like the other stats movements, while
interrupts still disabled in migrate_page_move_mapping(); and don't even
bother if the zone is the same. Do the PageDirty movement there under
tree_lock too, where old page is frozen and newpage not yet visible:
bearing in mind that as soon as newpage becomes visible in radix_tree, an
un-page-locked set_page_dirty() might interfere (or perhaps that's just
not possible: anything doing so should already hold an additional
reference to the old page, preventing its migration; but play safe).

But we do still need to transfer PageDirty in migrate_page_copy(), for
those who don't go the mapping route through migrate_page_move_mapping().

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 42cb14b110a5698ccf26ce59c4441722605a3743)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
mm/migrate.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen-netfront: cast grant table reference first to type int

IS_ERR_VALUE() in commit 87557efc27f6a50140fb20df06a917f368ce3c66
("xen-netfront: do not cast grant table reference to signed short") would
not return true for error code unless we cast ref first to type int.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Oracle-Bug: 25138361
upstream commit: 269ebce4531b8edc4224259a02143181a1c1d77c
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed by: Jack F. Vogel <jack.vogel@oracle.com>
Acked-by: Joe Jin <joe.jin@oracle.com>

xen-netfront: do not cast grant table reference to signed short

While grant reference is of type uint32_t, xen-netfront erroneously casts
it to signed short in BUG_ON().

This would lead to the xen domU panic during boot-up or migration when it
is attached with lots of paravirtual devices.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Oracle-Bug: 25138362
upstream commit: 87557efc27f6a50140fb20df06a917f368ce3c66
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed by: Jack F. Vogel <jack.vogel@oracle.com>
Acked-by: Joe Jin <joe.jin@oracle.com>

IB/core: uverbs: Do not wait for uverbs dev closure during forced system shutdown.

Waiting for uverbs dev file closure in the system forced shutdown [e.g
reboot -f] path locks up the reboot process forever preventing system
shutdown as application closure of this device file is very unlikely at
this time. This closure happens in orderly shutdown [e.g reboot] as
processes are sent SIGTERM.

Orabug: 24918039

Signed-off-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Reviewed-by: Avinash Repaka <avinash.repaka@oracle.com>

uek-rpm:enable ldmvsw module

Orabug: 23215917

Signed-off-by: Allen Pais <allen.pais@oracle.com>

Merge branch topic/uek-4.1/secureboot of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/secureboot:
acpi: Disable ACPI table override if securelevel is set

acpi: Disable ACPI table override if securelevel is set

From the kernel documentation (initrd_table_override.txt):

  If the ACPI_INITRD_TABLE_OVERRIDE compile option is true, it is possible
  to override nearly any ACPI table provided by the BIOS with an
  instrumented, modified one.

When securelevel is set, the kernel should disallow any unauthenticated
changes to kernel space. ACPI tables contain code invoked by the kernel, so
do not allow ACPI tables to be overridden if securelevel is set.

Signed-off-by: Linn Crosetto <linn@hpe.com>
Orabug: 25058372
CVE: CVE-2016-3699
Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Reviewed-by: Guru Anbalagane <guru.anbalagane@oracle.com>

IB/cm: avoid query device in CM REQ/REP

The query device needed in CM REQ/REP is a bit expensive since
it involves a MAD query and also it is not saved in the cache.
When the driver that holds the local device is different from
sif then there is no need to go through the query device. If
sif driver is identified, then we still need to go through the
query device in order to get the specific vendor id. This last
is to make sure the software workaround is applied only to the
PSIF revisions that are affected.

This patch filters those cases and avoids unnecessary MAD queries
when driver is different from sif.

Orabug: 24785622

Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Francisco Triviño <francisco.trivino@oracle.com>

IB/cm: return original rnr value when RNR WA for PSIF

With this patch, the original min_rnr_value set by the user is saved
in case it is later queried. The ib_qp flag has been re-purposed to
store the value in addition. This patch makes the RNR WA implementation
total transparent for the user.

Orabug: 24785622

Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Francisco Triviño <francisco.trivino@oracle.com>

IB/cm: MBIT needs to be used in network order

This patch fixes the MBIT big endian portability.

Orabug: 24785622

Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Francisco Triviño <francisco.trivino@oracle.com>

IB/core: Issue DREQ when receiving REQ/REP for stale QP

from "InfiBand Architecture Specifications Volume 1":

  A QP is said to have a stale connection when only one side has
  connection information. A stale connection may result if the remote CM
  had dropped the connection and sent a DREQ but the DREQ was never
  received by the local CM. Alternatively the remote CM may have lost
  all record of past connections because its node crashed and rebooted,
  while the local CM did not become aware of the remote node's reboot
  and therefore did not clean up stale connections.

and:

   A local CM may receive a REQ/REP for a stale connection. It shall
   abort the connection issuing REJ to the REQ/REP. It shall then issue
   DREQ with "DREQ:remote QPNâ\80\9d set to the remote QPN from the REQ/REP.

This patch solves a problem with reuse of QPN. Current codebase, that
is IPoIB, relies on a REAP-mechanism to do cleanup of the structures
in CM. A problem with this is the timeconstants governing this
mechanism; they are up to 768 seconds and the interface may look
inresponsive in that period.  Issuing a DREQ (and receiving a DREP)
does the necessary cleanup and the interface comes up.

Orabug: 24806985

Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

sif: cq: cleanup cqe once a kernel qp is destroyed/reset

To ease the sqflush/rqflush workaround, sifdrv removes
all the associated cqes once a kernel qp is destroy/reset.

Even though IB specification 10.2.4.4 mentioned that
"Destroying a QP does not guarantee that CQEs of that
QP are deallocated from the CQ upon destruction.",
it also stated that "Even if the CQEs are already on
the CQ, it might not be possible to retrieve them"

Thus, IB spec is indicating that it is vendor specific
implementation and the ULP should not assume that the
cqes are in the cq once a qp is destroyed/reset.

Orabug: 25070316

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: cq: sif_poll_cq might not drain cq completely

In a scenario where a duplicate completion is detected, the sif_poll_cq
skips the remaining cqes in the cq. This code bug is introduced in
commit "sif: sqflush: Handle duplicate completions in poll_cq".

Orabug: 25038711

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: rq: do not flush rq if it is an srq

sifdrv needs to flush a regular rq (non-srq) once it
detects that a qp is transitioned into ERR state.
Nevertheless, if a qp is created with srq and
with no event handler, the srq might be accidentally
flushed once the qp is transitioned into ERR sate.

Orabug: 25071205

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

sif: cq: use refcnt to disable/enable cq polling

Due to a hardware bug, sifdrv must clean up the cq
before it is being polled by the user. Thus, sifdrv
uses CQ_POLLING_NOT_ALLOWED bitmask to disable/enable
cq polling.

Nevertheless, the bit mask operation is not sufficient in
a shared cq scenario (many qps to one cq). The cq clean up
is performed by each qp and it might be performed concurrently.
As a result, the cq polling might be enabled before all qps
have clean up the cq.

Thus, this patch uses refcnt to disable/enable cq polling.
The CQ_POLLING_NOT_ALLOWED bitmask is kept in order to have
backward compatibility in the user library.

Orabug: 25038731

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: pt: Add support for single thread modified page tables

Modifications to the page tables in sif_pt is protected by
a lock to allow multiple threads to add and subtract regions
to/from the page table in parallel. This functionality is
currently only needed/used by the special sq_cmpl page table
handling. In the future we might however need this also for
other cases, for instance to optimize further on page table
memory usage.

The kernel documentation for infiniband midlayer locking
requires that map_phys_fmr should be callable from any context.
This prevents us from blocking on a lock, something that happens
if there are contention for the lock (eg. more than one thread
involved in modifying the page table)

Implement another flag: thread_safe in a pt that determines
if a page table is going to need to be modified from multiple
threads simultaneously. For now keep a BUG_ON if the code
is attempted accessed in parallel for memory types
that should not ever see parallel access.

Orabug: 24836269

Signed-off-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Francisco Trivino-Garcia <francisco.trivino@oracle.com>

sif: pqp: Implement handling of PQPs in error.

The assumption is that any such situation that can arise in
production is due to an application that causes it's CQ to go
to error and where the PQP subsequently tries to post a CQ
operation that affects the CQ that is in error. In these cases,
the PQP itself goes to error and an event is generated.

This commit refactors the modify_qp logic slightly, as well as
implementing a modification cycle to bring a privileged QP
back up again. It also adds a new pqp debugfs file and some statistics
to help monitoring the new PQP specific state as well.

The resurrect operation is queued on the sif workqueue
by the new handle_pqp_event function, which is now properly
wired up to accept all PQP events. When a PQP is detected as
being in error, its last_set_state is updated, and in addition
the write_only flag is set, which causes new send reqs not to
touch any collect buffer as part of the operation.
This flag was introduced to allow the resurrect to set the PQP in
RTS again while still not triggering any sends.

This way the implementation allows clients to continue to
post requests to the PQP while it is in error or in transition
back to RTS again by just accepting these requests into the PQP
send queue without any writes to the collect buffer.

When in the INIT state, the resurrect worker updates
the SQ pointers to skip the request that triggered
the PQP error.

Once back in RTS, the resurrect worker can take the single SQ lock
which serializes posts, check the size of the send queue
and if >= 0, trigger the send queue scheduler to start processing these.
Once the QP is in SQS mode, or just idle if the queue was empty,
it is safe for ordering purposes to let normal posting with
collect buffer writes commence.

Orabug: 24715634

Signed-off-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Francisco Trivino-Garcia <francisco.trivino@oracle.com>

sif: eps*: initialize each struct in array

explisitly zero all the structs, set and retain also for epsa0
which was (unintentionally) zeroed also when looping through rest
of the eps entries.

Make sure to set members of the struct after initialization.

Orabug: 25033224

Signed-off-by: George Refseth <george.refseth@oracle.com>
Reviewed-by: Åsmund Østvold <asmund.ostvold@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: query_device: Return correct #SGEs for EoIB

ULPs that use LSO, can only create QPs with #SGE entries being one
less than what is supported in HW. This because PSIF uses one entry
for the LSO stencil. The driver attempts to detect the ULP and from
that derive if LSO will be used.

This commit adds support for XVE (Xsigo Virtual Ethernet), and a
query_device() from the XVE driver (not the connected mode part), will
return one less #SGEs.

Further, since the XVE ULP is detected, we amend the sysfs listing of
QPs with EoIB (for XVE datagram mode QPs) and EoIB_CM (for XVE
connected mode QPs).

Orabug: 25027106

Signed-off-by: Hakon Bugge <Haakon.Bugge@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

sif: LSO not supported for EoIB queuepairs

Due to missing decoding of create_flags, LSO is not enabled for EoIB
queuepairs. Regression was introduced in 4.1.12-71 kernel

Orabug: 25026132

Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

sif: pqp: Make setup/teardown function ref sif_pqp_info directly

Take advantage of the new, cleaner and more separate PQP data structures:
Simplify/abstract pqp setup/teardown by pointing into sdev
for a direct ref to the sif_pqp_info data structure.

Orabug: 24715634

Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif: Move the rest of the pqp setup and teardown to sif_pqp

This finishes the restructure started in the previous commit,
to consolidate PQP handling logic to make it simpler
to extend without spreading the complexity to other parts
of the code.

Orabug: 24715634

Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif: Move sif_dfs_register beyond base init

The debugfs setup must be initialized prior to PQP operation,
but must also be deinitialized before base table takedown,
otherwise we are exposed to faults due to a race condition between
a user accessing debugfs tables and driver unload.

This commit moves the dfs init/deinit from sif_probe to
sif_hw_init to achieve this order.

Orabug: 24971465

Signed-off-by: Knut Omang <knut.omang@oracle.com>

sif: Refactor PQP state out of sif_dev.

Prepare handling of PQPs for the additional complexity
of resurrect upon errors and query for undetected
error conditions by moving some of the state
now directly in sif_dev into a new sif_pqp_info
struct.

Already complex logic for PQP handling is going to
increase in complexity. We need to consolidate
to be able to keep clean and easily maintainable
interfaces.

Orabug: 24715634

Signed-off-by: Knut Omang <knut.omang@oracle.com>

Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/upstream-cherry-picks:
ecryptfs: don't allow mmap when the lower fs doesn't support it
Revert "ecryptfs: forbid opening files without mmap handler"

ecryptfs: don't allow mmap when the lower fs doesn't support it

There are legitimate reasons to disallow mmap on certain files, notably
in sysfs or procfs. We shouldn't emulate mmap support on file systems
that don't offer support natively.

CVE-2016-1583

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Cc: stable@vger.kernel.org
[tyhicks: clean up f_op check by using ecryptfs_file_to_lower()]
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Mainline v4.7 commit f0fe970df3838c202ef6c07a4c2b36838ef0a88b
Replaces UEK4 commit e06914f2e9ac6b3f19d4461cb24b401f77ce4f17
which was reverted by UEK4 commit b1660e855b21.

Orabug: 24971905
CVE: CVE-2016-1583
Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com

Revert "ecryptfs: forbid opening files without mmap handler"

This reverts commit UEK4 e06914f2e9ac6b3f19d4461cb24b401f77ce4f17
which was based on mainline v4.7 commit:
2f36db71009304b3f0b95afacd8eba1f9f046b87
ecryptfs: forbid opening files without mmap handler

2f36db71 was replaced by mainline v4.7 commit:
f0fe970df3838c202ef6c07a4c2b36838ef0a88b
ecryptfs: don't allow mmap when the lower fs doesn't support it
which follows in another commit.

Also see mainline 4.7 commit 78c4e172412de5d0456dc00d2b34050aa0b683b5
Revert "ecryptfs: forbid opening files without mmap handler"

Orabug: 24971905
CVE: CVE-2016-1583
Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>

Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/upstream-cherry-picks:
  percpu: fix synchronization between synchronous map extension and chunk destruction
  percpu: fix synchronization between chunk->map_extend_work and chunk destruction
  ALSA: timer: Fix leak in events via snd_timer_user_tinterrupt
  ALSA: timer: Fix leak in events via snd_timer_user_ccallback
  ALSA: timer: Fix leak in SNDRV_TIMER_IOCTL_PARAMS

percpu: fix synchronization between synchronous map extension and chunk destruction

For non-atomic allocations, pcpu_alloc() can try to extend the area
map synchronously after dropping pcpu_lock; however, the extension
wasn't synchronized against chunk destruction and the chunk might get
freed while extension is in progress.

This patch fixes the bug by putting most of non-atomic allocations
under pcpu_alloc_mutex to synchronize against pcpu_balance_work which
is responsible for async chunk management including destruction.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: stable@vger.kernel.org # v3.18+
Fixes: 1a4d76076cda ("percpu: implement asynchronous chunk population")
Orabug: 25060076
CVE: CVE-2016-4794
Mainline v4.7 commit 6710e594f71ccaad8101bc64321152af7cd9ea28
Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>

percpu: fix synchronization between chunk->map_extend_work and chunk destruction

Atomic allocations can trigger async map extensions which is serviced
by chunk->map_extend_work. pcpu_balance_work which is responsible for
destroying idle chunks wasn't synchronizing properly against
chunk->map_extend_work and may end up freeing the chunk while the work
item is still in flight.

This patch fixes the bug by rolling async map extension operations
into pcpu_balance_work.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: stable@vger.kernel.org # v3.18+
Fixes: 9c824b6a172c ("percpu: make sure chunk->map array has available space")
Orabug: 25060076
CVE: CVE-2016-4794
Mainline v4.7 commit 4f996e234dad488e5d9ba0858bc1bae12eff82c3
Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>

ALSA: timer: Fix leak in events via snd_timer_user_tinterrupt

The stack object “r1” has a total size of 32 bytes. Its field
“event” and “val” both contain 4 bytes padding. These 8 bytes
padding bytes are sent to user without being initialized.

Signed-off-by: Kangjie Lu <kjlu@gatech.edu>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Orabug: 25059885
CVE: CVE-2016-4578
Mainline v4.7 commit e4ec8cc8039a7063e24204299b462bd1383184a5
Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>

ALSA: timer: Fix leak in events via snd_timer_user_ccallback

The stack object “r1” has a total size of 32 bytes. Its field
“event” and “val” both contain 4 bytes padding. These 8 bytes
padding bytes are sent to user without being initialized.

Signed-off-by: Kangjie Lu <kjlu@gatech.edu>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Orabug: 25059885
CVE: CVE-2016-4578
Mainline v4.7 commit 9a47e9cff994f37f7f0dbd9ae23740d0f64f9fe6
Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>

ALSA: timer: Fix leak in SNDRV_TIMER_IOCTL_PARAMS

The stack object “tread” has a total size of 32 bytes. Its field
“event” and “val” both contain 4 bytes padding. These 8 bytes
padding bytes are sent to user without being initialized.

Signed-off-by: Kangjie Lu <kjlu@gatech.edu>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Orabug: 25059408
CVE: CVE-2016-4569
Mainline v4.7 commit cec8f96e49d9be372fdb0c3836dcf31ec71e457e
Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>

Merge branch topic/uek-4.1/rpm-build of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/rpm-build:
uek-rpm ol7: change uek-rpm/ol7/update-el release value from 7.1 to 7.3

Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/upstream-cherry-picks:
perf tools: handle spaces in file names obtained from /proc/pid/maps

perf tools: handle spaces in file names obtained from /proc/pid/maps

Steam frequently puts game binaries in folders with spaces.

Note: "(deleted)" markers are now treated as part of the file name.

Signed-off-by: Marcin Ślusarz <marcin.slusarz@gmail.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Fixes: 6064803313ba ("perf tools: Use sscanf for parsing /proc/pid/maps")
Link: http://lkml.kernel.org/r/20160119190303.GA17579@marcin-Inspiron-7720
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
(cherry picked from commit 89fee59b504f86925894fcc9ba79d5c933842f93)

Orabug: 25072114
Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>

uek-rpm ol7: change uek-rpm/ol7/update-el release value from 7.1 to 7.3

Change release value in uek-rpm/ol7/update-el to 7.3 so that manual builds
will pick up the new OL7.3 secure boot key.
uek-rpm/ol6/update-el is not affected.

Orabug: 25050588

Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Reviewed-by: Guru Anbalagane <guru.anbalagane@oracle.com>

Merge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/ofed:
  xsigo: send nack codes
  xsigo: xve driver has excessive messages
  xsigo: hard LOCKUP in freeing paths
  xsigo: Crash in xscore_port_num
  xsigo: Resize uVNIC/PVI CQ size
  xsigo: Optimizing Transmit completions
  xsigo: Implementing Jumbo MTU support
  RDS: rds debug messages are enabled by default
  net/rds: Fix new sparse warning
  net/rds: fix unaligned memory access

Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/upstream-cherry-picks: (23 commits)
  NFS: Fix an LOCK/OPEN race when unlinking an open file
  intel_idle: correct BXT support
  intel_idle: re-work bxt_idle_state_table_update() and its helper
  x86/intel_idle: Use Intel family macros for intel_idle
  x86/cpu/intel: Introduce macros for Intel family numbers
  intel_idle: add BXT support
  intel_idle: Add KBL support
  intel_idle: Add SKX support
  intel_idle: Clean up all registered devices on exit.
  intel_idle: Propagate hot plug errors.
  intel_idle: Don't overreact to a cpuidle registration failure.
  intel_idle: Setup the timer broadcast only on successful driver load.
  intel_idle: Avoid a double free of the per-CPU data.
  intel_idle: Fix dangling registration on error path.
  intel_idle: Fix deallocation order on the driver exit path.
  intel_idle: Remove redundant initialization calls.
  intel_idle: Fix a helper function's return value.
  intel_idle: remove useless return from void function.
  intel_idle: Support for Intel Xeon Phi Processor x200 Product Family
  intel_idle: prevent SKL-H boot failure when C8+C9+C10 enabled
  ...

xsigo: send nack codes

Orabug: 24442792

Sometime uVNIC removal on OFOS won't trigger a actual removal
of Vstar interface, in that case uVNIC driver has to send NACK
code so that XCM will start cleaning its database.

Added additional codes as per XCM specification

Reported-by: jie zhu <jie.x.zhu@oracle.com>
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Reviewed-by: Qingjun Wang <qingjun.wang@oracle.com>
Reviewed-by: Manish Kumar Singh <mk.singh@oracle.com>
Reviewed-by: UmaShankar Tumari Mahabalagiri <umashankar.mahabalagiri@oracle.com>

xsigo: xve driver has excessive messages

Orabug: 24758335

Moved some message types from Warning to debug.

Consolidated multiple messages into single to avoid
flooding of messages on console

Added more counters to identify state of vnic.

Added a debug type xve_info

Reported-by: chien yen <chien.yen@oracle.com>
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Reviewed-by: Aravind Kini <aravind.kini@oracle.com>
Reviewed-by: UmaShankar Tumari Mahabalagiri <umashankar.mahabalagiri@oracle.com>

xsigo: hard LOCKUP in freeing paths

Orabug: 24669507

When path->users becomes zero uVNIC driver starts
cleaning up the Forwarding table entries.

In some corner cases the call is invoked from transmit
function which is in interrupt context and that results
in a hard LOCKUP.

With new changes path->users is decremented in transmit
function to allow cleanup to happen from other thread.
Proper care is taken to avoid race between these
two contexts.

Reported-by: chien yen <chien.yen@oracle.com>
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Reviewed-by: Aravind Kini <aravind.kini@oracle.com>
Reviewed-by: viswa krishnamurthy <viswa.krishnamurthy@oracle.com>
Reviewed-by: Manish Kumar Singh <mk.singh@oracle.com>
Reviewed-by: UmaShankar Tumari Mahabalagiri <umashankar.mahabalagiri@oracle.com>

xsigo: Crash in xscore_port_num

Orabug: 24760465

When Server Profile context is not present
xcpm_get_xsmp_session_info returns error and uVNIC
driver has to handle that conditions

Reported-by: scarlett chen <scarlett.chen@oracle.com>
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Reviewed-by: viswa krishnamurthy <viswa.krishnamurthy@oracle.com>
Reviewed-by: UmaShankar Tumari Mahabalagiri <umashankar.mahabalagiri@oracle.com>

xsigo: Resize uVNIC/PVI CQ size

Orabug: 24765034

uVNIC/PVI should avoid CQ overflow condition

Resize CQ's to 16k to handle multiple connections
flushed simultaneously per path.

Increase Send Queue and receive Queue to 2k for
better performance.

Added counters to print CQ sizes.
Added stats to count RC completions.

Reported-by: scarlett chen <scarlett.chen@oracle.com>
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Reviewed-by: Aravind Kini <aravind.kini@oracle.com>
Reviewed-by: viswa krishnamurthy <viswa.krishnamurthy@oracle.com>
Reviewed-by: Manish Kumar Singh <mk.singh@oracle.com>
Reviewed-by: UmaShankar Tumari Mahabalagiri <umashankar.mahabalagiri@oracle.com>

xsigo: Optimizing Transmit completions

Orabug: 24928865

Added a timer for polling Transmit completion and
removed polling completion from a thread context.

Seeing Good Performance improvments with the changes.
In some cases uVNIC is seeing 10% increase in throughput

Reported-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Reviewed-by: sajid zia <szia@oracle.com>

xsigo: Implementing Jumbo MTU support

Orabug: 24928804

With Titan and Saturn supporting Jumbo Infiniband frames
uVNIC can have MTU greater than 4k and upto 10k.

Allocate multiple pages for Receive descriptors code changes
for handling multiple page mapping and unmapping.

Took proper care for enabling Jumbo MTU only for Titan and only
in EoiB mode.

If Jumbo MTU is used for non-Titan cards uVNIC driver will NACK
the Install and OFOS will display a failure message for the
install.

Added stats to display Jumbo & removed legacy EoiB HeartBeat code.

Reported-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Reviewed-by: sajid zia <szia@oracle.com>

NFS: Fix an LOCK/OPEN race when unlinking an open file

Orabug: 24476280

At Connectathon 2016, we found that recent upstream Linux clients
would occasionally send a LOCK operation with a zero stateid. This
appeared to happen in close proximity to another thread returning
a delegation before unlinking the same file while it remained open.

Earlier, the client received a write delegation on this file and
returned the open stateid. Now, as it is getting ready to unlink the
file, it returns the write delegation. But there is still an open
file descriptor on that file, so the client must OPEN the file
again before it returns the delegation.

Since commit 24311f884189 ('NFSv4: Recovery of recalled read
delegations is broken'), nfs_open_delegation_recall() clears the
NFS_DELEGATED_STATE flag _before_ it sends the OPEN. This allows a
racing LOCK on the same inode to be put on the wire before the OPEN
operation has returned a valid open stateid.

To eliminate this race, serialize delegation return with the
acquisition of a file lock on the same file. Adopt the same approach
as is used in the unlock path.

This patch also eliminates a similar race seen when sending a LOCK
operation at the same time as returning a delegation on the same file.

Fixes: 24311f884189 ('NFSv4: Recovery of recalled read ... ')
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
[Anna: Add sentence about LOCK / delegation race]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
(cherry picked from commit 11476e9dec39d90fe1e9bf12abc6f3efe35a073d)
Signed-off-by: Todd Vierling <todd.vierling@oracle.com>

intel_idle: correct BXT support

Orabug: 24810432

Commit 5dcef69486 ("intel_idle: add BXT support") added an 8-element
lookup array with just a 2-bit value used for lookups. As per the SDM
that bit field is really 3 bits wide. While this is supposedly benign
here, future re-use of the code for other CPUs might expose the issue.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit bef450962597ff39a7f9d53a30523aae9eb55843)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

intel_idle: re-work bxt_idle_state_table_update() and its helper

Orabug: 24810432

Since irtl_ns_units[] has itself zero entries, make sure the caller
recognized those cases along with the MSR read returning zero, as zero
is not a valid value for exit_latency and target_residency.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 3451ab3ebf92b12801878d8b5c94845afd4219f0)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/intel_idle: Use Intel family macros for intel_idle

Orabug: 24810432

Use the new INTEL_FAM6_* macros for intel_idle.c.  Also fix up
some of the macros to be consistent with how some of the
intel_idle code refers to the model.

There's on oddity here: model 0x1F is uniquely referred to here
and nowhere else that I could find.  0x1E/0x1F are just spelled
out as "Intel Core i7 and i5 Processors" in the SDM or as "Intel
processors based on the Nehalem, Westmere microarchitectures" in
the RDPMC section.  Comments between tables 19-19 and 19-20 in
the SDM seem to point to 0x1F being some kind of Westmere, so
let's call it "WESTMERE2".

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jacob.jun.pan@intel.com
Cc: linux-pm@vger.kernel.org
Link: http://lkml.kernel.org/r/20160603001932.EE978EB9@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit db73c5a8c80decbb6ddf208e58f3865b4df5384d)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/cpu/intel: Introduce macros for Intel family numbers

Orabug: 24810432

Problem:

We have a boatload of open-coded family-6 model numbers.  Half of
them have these model numbers in hex and the other half in
decimal.  This makes grepping for them tons of fun, if you were
to try.

Solution:

Consolidate all the magic numbers.  Put all the definitions in
one header.

The names here are closely derived from the comments describing
the models from arch/x86/events/intel/core.c.  We could easily
make them shorter by doing things like s/SANDYBRIDGE/SNB/, but
they seemed fine even with the longer versions to me.

Do not take any of these names too literally, like "DESKTOP"
or "MOBILE".  These are all colloquial names and not precise
descriptions of everywhere a given model will show up.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Rajneesh Bhardwaj <rajneesh.bhardwaj@intel.com>
Cc: Souvik Kumar Chakravarty <souvik.k.chakravarty@intel.com>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Vishwanath Somayaji <vishwanath.somayaji@intel.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: jacob.jun.pan@intel.com
Cc: linux-acpi@vger.kernel.org
Cc: linux-edac@vger.kernel.org
Cc: linux-mmc@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: platform-driver-x86@vger.kernel.org
Link: http://lkml.kernel.org/r/20160603001927.F2A7D828@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 970442c599b22ccd644ebfe94d1d303bf6f87c05)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

intel_idle: add BXT support

Orabug: 24810432

Broxton has all the HSW C-states, except C3.
BXT C-state timing is slightly different.

Here we trust the IRTL MSRs as authority
on maximum C-state latency, and override the driver's tables
with the values found in the associated IRTL MSRs.
Further we set the target_residency to 1x maximum latency,
trusting the hardware demotion logic.

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 5dcef694860100fd16885f052591b1268b764d21)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/msr-index.h

intel_idle: Add KBL support

Orabug: 24810432

KBL is similar to SKL

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 3ce093d4de753d6c92cc09366e29d0618a62f542)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

intel_idle: Add SKX support

Orabug: 24810432

SKX is similar to BDX

Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit f9e71657c2c0a8f1c50884ab45794be2854e158e)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

intel_idle: Clean up all registered devices on exit.

Orabug: 24810432

This driver registers cpuidle devices when a CPU comes online, but it
leaves the registrations in place when a CPU goes offline. The module
exit code only unregisters the currently online CPUs, leaving the
devices for offline CPUs dangling.

This patch changes the driver to clean up all registrations on exit,
even those from CPUs that are offline.

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 3e66a9ab53641a0f7a440e56f7b35bf5d77494b3)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

intel_idle: Propagate hot plug errors.

Orabug: 24810432

If a cpuidle registration error occurs during the hot plug notifier
callback, we should really inform the hot plug machinery instead of
just ignoring the error. This patch changes the callback to properly
return on error.

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 08820546e4c30c84d0a1f1a49df055e1719c07ea)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

intel_idle: Don't overreact to a cpuidle registration failure.

Orabug: 24810432

The helper function, intel_idle_cpu_init, registers one new device
with the cpuidle layer. If the registration should fail, that
function immediately calls intel_idle_cpuidle_devices_uninit() to
unregister every last CPU's device. However, it makes no sense to do
so, when called from the hot plug notifier callback.

This patch moves the call to intel_idle_cpuidle_devices_uninit()
outside of the helper function to the one call site that actually
needs to perform the de-registrations.

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit b69ef2c099c3e5f11bd5c33a9530d6522f72c9aa)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

intel_idle: Setup the timer broadcast only on successful driver load.

Orabug: 24810432

This driver sets the broadcast tick quite early on during probe and does
not clean up again in cast of failure. This patch moves the setup call
after the registration, placing the on_each_cpu() calls within the global
CPU lock region.

Signed-off-by: Richard Cochran <rcochran@linutronix.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 2259a819a8d37e472f08c88bc0dd22194754adb4)
Signed-off-by: Brian Maly <brian.maly@oracle.com>