www.infradead.org Git - users/jedix/linux-maple.git/log

net: ipv4: Consider failed nexthops in multipath routes

Multipath route lookups should consider knowledge about next hops and not
select a hop that is known to be failed.

Example:

                     [h2]                   [h3]   15.0.0.5
                      |                      |
                     3|                     3|
                    [SP1]                  [SP2]--+
                     1  2                   1     2
                     |  |     /-------------+     |
                     |   \   /                    |
                     |     X                      |
                     |    / \                     |
                     |   /   \---------------\    |
                     1  2                     1   2
         12.0.0.2  [TOR1] 3-----------------3 [TOR2] 12.0.0.3
                     4                         4
                      \                       /
                        \                    /
                         \                  /
                          -------|   |-----/
                                 1   2
                                [TOR3]
                                  3|
                                   |
                                  [h1]  12.0.0.1

host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:

    root@h1:~# ip ro ls
    ...
    12.0.0.0/24 dev swp1  proto kernel  scope link  src 12.0.0.1
    15.0.0.0/16
            nexthop via 12.0.0.2  dev swp1 weight 1
            nexthop via 12.0.0.3  dev swp1 weight 1
    ...

If the link between tor3 and tor1 is down and the link between tor1
and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
ssh 15.0.0.5 gets the other. Connections that attempt to use the
12.0.0.2 nexthop fail since that neighbor is not reachable:

    root@h1:~# ip neigh show
    ...
    12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
    12.0.0.2 dev swp1  FAILED
    ...

The failed path can be avoided by considering known neighbor information
when selecting next hops. If the neighbor lookup fails we have no
knowledge about the nexthop, so give it a shot. If there is an entry
then only select the nexthop if the state is sane. This is similar to
what fib_detect_death does.

To maintain backward compatibility use of the neighbor information is
based on a new sysctl, fib_multipath_use_neigh.

Orabug: 27547114

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit a6db4494d218c2e559173661ee972e048dc04fdd)

Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
net/ipv4/fib_semantics.c
net/ipv4/sysctl_net_ipv4.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

ipv4: L3 hash-based multipath

Replaces the per-packet multipath with a hash-based multipath using
source and destination address.

Orabug: 27547114

Signed-off-by: Peter Nørlund <pch@ordbogen.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 0e884c78ee19e902f300ed147083c28a0c6302f0)

Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
include/net/ip_fib.h
net/ipv4/fib_semantics.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

dm: fix race between dm_get_from_kobject() and __dm_destroy()

Orabug: 27677556
CVE: CVE-2017-18203

The following BUG_ON was hit when testing repeat creation and removal of
DM devices:

    kernel BUG at drivers/md/dm.c:2919!
    CPU: 7 PID: 750 Comm: systemd-udevd Not tainted 4.1.44
    Call Trace:
     [<ffffffff81649e8b>] dm_get_from_kobject+0x34/0x3a
     [<ffffffff81650ef1>] dm_attr_show+0x2b/0x5e
     [<ffffffff817b46d1>] ? mutex_lock+0x26/0x44
     [<ffffffff811df7f5>] sysfs_kf_seq_show+0x83/0xcf
     [<ffffffff811de257>] kernfs_seq_show+0x23/0x25
     [<ffffffff81199118>] seq_read+0x16f/0x325
     [<ffffffff811de994>] kernfs_fop_read+0x3a/0x13f
     [<ffffffff8117b625>] __vfs_read+0x26/0x9d
     [<ffffffff8130eb59>] ? security_file_permission+0x3c/0x44
     [<ffffffff8117bdb8>] ? rw_verify_area+0x83/0xd9
     [<ffffffff8117be9d>] vfs_read+0x8f/0xcf
     [<ffffffff81193e34>] ? __fdget_pos+0x12/0x41
     [<ffffffff8117c686>] SyS_read+0x4b/0x76
     [<ffffffff817b606e>] system_call_fastpath+0x12/0x71

The bug can be easily triggered, if an extra delay (e.g. 10ms) is added
between the test of DMF_FREEING & DMF_DELETING and dm_get() in
dm_get_from_kobject().

To fix it, we need to ensure the test of DMF_FREEING & DMF_DELETING and
dm_get() are done in an atomic way, so _minor_lock is used.

The other callers of dm_get() have also been checked to be OK: some
callers invoke dm_get() under _minor_lock, some callers invoke it under
_hash_lock, and dm_start_request() invoke it after increasing
md->open_count.

Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
(cherry picked from commit b9a41d21dceadf8104812626ef85dc56ee8a60ed)

Reviewd-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

NFS: only invalidate dentrys that are clearly invalid.

Since commit bafc9b754f75 ("vfs: More precise tests in d_invalidate")
in v3.18, a return of '0' from ->d_revalidate() will cause the dentry
to be invalidated even if it has filesystems mounted on or it or on a
descendant.  The mounted filesystem is unmounted.

This means we need to be careful not to return 0 unless the directory
referred to truly is invalid.  So -ESTALE or -ENOENT should invalidate
the directory.  Other errors such a -EPERM or -ERESTARTSYS should be
returned from ->d_revalidate() so they are propagated to the caller.

A particular problem can be demonstrated by:

1/ mount an NFS filesystem using NFSv3 on /mnt
2/ mount any other filesystem on /mnt/foo
3/ ls /mnt/foo
4/ turn off network, or otherwise make the server unable to respond
5/ ls /mnt/foo &
6/ cat /proc/$!/stack # note that nfs_lookup_revalidate is in the call stack
7/ kill -9 $! # this results in -ERESTARTSYS being returned
8/ observe that /mnt/foo has been unmounted.

This patch changes nfs_lookup_revalidate() to only treat
  -ESTALE from nfs_lookup_verify_inode() and
  -ESTALE or -ENOENT from ->lookup()
as indicating an invalid inode.  Other errors are returned.

Also nfs_check_inode_attributes() is changed to return -ESTALE rather
than -EIO.  This is consistent with the error returned in similar
circumstances from nfs_update_inode().

As this bug allows any user to unmount a filesystem mounted on an NFS
filesystem, this fix is suitable for stable kernels.

Fixes: bafc9b754f75 ("vfs: More precise tests in d_invalidate")
Cc: stable@vger.kernel.org (v3.18+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Orabug: 27870824

(cherry picked from commit cc89684c9a265828ce061037f1f79f4a68ccd3f7)
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: Improve handling of failures on link and route dumps

In general, rtnetlink dumps do not anticipate failure to dump a single
object (e.g., link or route) on a single pass. As both route and link
objects have grown via more attributes, that is no longer a given.

netlink dumps can handle a failure if the dump function returns an
error; specifically, netlink_dump adds the return code to the response
if it is <= 0 so userspace is notified of the failure. The missing
piece is the rtnetlink dump functions returning the error.

Fix route and link dump functions to return the errors if no object is
added to an skb (detected by skb->len != 0). IPv6 route dumps
(rt6_dump_route) already return the error; this patch updates IPv4 and
link dumps. Other dump functions may need to be ajusted as well.

Reported-by: Jan Moskyto Matejka <mq@ucw.cz>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 27959177
(cherry picked from commit f6c5775ff0bfa62b072face6bf1d40f659f194b2)
Add missing err variable for the cherry-pick in rtnetlink.c
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Dan Duval <dan.duval@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm/mempolicy: fix use after free when calling get_mempolicy

I hit a use after free issue when executing trinity and repoduced it
with KASAN enabled.  The related call trace is as follows.

  BUG: KASan: use after free in SyS_get_mempolicy+0x3c8/0x960 at addr ffff8801f582d766
  Read of size 2 by task syz-executor1/798

  INFO: Allocated in mpol_new.part.2+0x74/0x160 age=3 cpu=1 pid=799
     __slab_alloc+0x768/0x970
     kmem_cache_alloc+0x2e7/0x450
     mpol_new.part.2+0x74/0x160
     mpol_new+0x66/0x80
     SyS_mbind+0x267/0x9f0
     system_call_fastpath+0x16/0x1b
  INFO: Freed in __mpol_put+0x2b/0x40 age=4 cpu=1 pid=799
     __slab_free+0x495/0x8e0
     kmem_cache_free+0x2f3/0x4c0
     __mpol_put+0x2b/0x40
     SyS_mbind+0x383/0x9f0
     system_call_fastpath+0x16/0x1b
  INFO: Slab 0xffffea0009cb8dc0 objects=23 used=8 fp=0xffff8801f582de40 flags=0x200000000004080
  INFO: Object 0xffff8801f582d760 @offset=5984 fp=0xffff8801f582d600

  Bytes b4 ffff8801f582d750: ae 01 ff ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a  ........ZZZZZZZZ
  Object ffff8801f582d760: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
  Object ffff8801f582d770: 6b 6b 6b 6b 6b 6b 6b a5                          kkkkkkk.
  Redzone ffff8801f582d778: bb bb bb bb bb bb bb bb                          ........
  Padding ffff8801f582d8b8: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ
  Memory state around the buggy address:
  ffff8801f582d600: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff8801f582d680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  >ffff8801f582d700: fc fc fc fc fc fc fc fc fc fc fc fc fb fb fb fc

!shared memory policy is not protected against parallel removal by other
thread which is normally protected by the mmap_sem.  do_get_mempolicy,
however, drops the lock midway while we can still access it later.

Early premature up_read is a historical artifact from times when
put_user was called in this path see https://lwn.net/Articles/124754/
but that is gone since 8bccd85ffbaf ("[PATCH] Implement sys_* do_*
layering in the memory policy layer.").  but when we have the the
current mempolicy ref count model.  The issue was introduced
accordingly.

Fix the issue by removing the premature release.

Link: http://lkml.kernel.org/r/1502950924-27521-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org> [2.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 27963519
CVE: CVE-2018-10675
(cherry picked from commit 73223e4e2e3867ebf033a5a8eb2e5df0158ccc99)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

drm: udl: Properly check framebuffer mmap offsets

Orabug: 27963530
CVE-2018-8781

The memmap options sent to the udl framebuffer driver were not being
checked for all sets of possible crazy values. Fix this up by properly
bounding the allowed values.

Reported-by: Eyal Itkin <eyalit@checkpoint.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20180321154553.GA18454@kroah.com
(cherry picked from commit 3b82a4db8eaccce735dffd50b4d4e1578099b8e8)

Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xfs: set format back to extents if xfs_bmap_extents_to_btree

If xfs_bmap_extents_to_btree fails in a mode where we call
xfs_iroot_realloc(-1) to de-allocate the root, set the
format back to extents.

Otherwise we can assume we can dereference ifp->if_broot
based on the XFS_DINODE_FMT_BTREE format, and crash.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199423
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
(cherry picked from commit 2c4306f719b083d17df2963bc761777576b8ad1b)

Orabug: 27963576
CVE: CVE-2018-10323
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/xfs/libxfs/xfs_bmap.c
Drop a part of the patch which is inapplicable to the current kernel.
The WARN_ON_ONCE in the dropped part is introduced in the upstream
commit 2fcc319d2467 as a debugging facility since v4.11 kernel.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "mlx4: change the ICM table allocations to lowest needed size"

This reverts commit UEK4-QU7 commit:
e7567cf1d53bddbe233dabea5c632658671dae9e
("mlx4: change the ICM table allocations to lowest needed size")

Orabug: 27980030

Signed-off-by: Håkon Bugge < haakon.bugge@oracle.com>
Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com>
Signed-off-by: Qing Huang <qing.huang@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Bluetooth: Prevent stack info leak from the EFS element.

Orabug: 28030514
CVE: CVE-2017-1000410

In the function l2cap_parse_conf_rsp and in the function
l2cap_parse_conf_req the following variable is declared without
initialization:

struct l2cap_conf_efs efs;

In addition, when parsing input configuration parameters in both of
these functions, the switch case for handling EFS elements may skip the
memcpy call that will write to the efs variable:

...
case L2CAP_CONF_EFS:
if (olen == sizeof(efs))
memcpy(&efs, (void *)val, olen);
...

The olen in the above if is attacker controlled, and regardless of that
if, in both of these functions the efs variable would eventually be
added to the outgoing configuration request that is being built:

l2cap_add_conf_opt(&ptr, L2CAP_CONF_EFS, sizeof(efs), (unsigned long) &efs);

So by sending a configuration request, or response, that contains an
L2CAP_CONF_EFS element, but with an element length that is not
sizeof(efs) - the memcpy to the uninitialized efs variable can be
avoided, and the uninitialized variable would be returned to the
attacker (16 bytes).

This issue has been assigned CVE-2017-1000410

Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Gustavo Padovan <gustavo@padovan.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Ben Seri <ben@armis.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 06e7e776ca4d36547e503279aeff996cbb292c16)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

netfilter: nfnetlink_cthelper: Add missing permission checks

The capability check in nfnetlink_rcv() verifies that the caller
has CAP_NET_ADMIN in the namespace that "owns" the netlink socket.
However, nfnl_cthelper_list is shared by all net namespaces on the
system.  An unprivileged user can create user and net namespaces
in which he holds CAP_NET_ADMIN to bypass the netlink_net_capable()
check:

    $ nfct helper list
    nfct v1.4.4: netlink error: Operation not permitted
    $ vpnns -- nfct helper list
    {
            .name = ftp,
            .queuenum = 0,
            .l3protonum = 2,
            .l4protonum = 6,
            .priv_data_len = 24,
            .status = enabled,
    };

Add capable() checks in nfnetlink_cthelper, as this is cleaner than
trying to generalize the solution.

Signed-off-by: Kevin Cernekee <cernekee@chromium.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit 4b380c42f7d00a395feede754f0bc2292eebe6e5)

Orabug: 27260771
CVE: CVE-2017-17448

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

netlink: Add netns check on taps

Currently, a nlmon link inside a child namespace can observe systemwide
netlink activity.  Filter the traffic so that nlmon can only sniff
netlink messages from its own netns.

Test case:

    vpnns -- bash -c "ip link add nlmon0 type nlmon; \
                      ip link set nlmon0 up; \
                      tcpdump -i nlmon0 -q -w /tmp/nlmon.pcap -U" &
    sudo ip xfrm state add src 10.1.1.1 dst 10.1.1.2 proto esp \
        spi 0x1 mode transport \
        auth sha1 0x6162633132330000000000000000000000000000 \
        enc aes 0x00000000000000000000000000000000
    grep --binary abc123 /tmp/nlmon.pcap

Signed-off-by: Kevin Cernekee <cernekee@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 93c647643b48f0131f02e45da3bd367d80443291)

Orabug: 27260799
CVE: CVE-2017-17449

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

KVM: Fix stack-out-of-bounds read in write_mmio

commit e39d200fa5bf5b94a0948db0dae44c1b73b84a56 upstream.

Reported by syzkaller:

  BUG: KASAN: stack-out-of-bounds in write_mmio+0x11e/0x270 [kvm]
  Read of size 8 at addr ffff8803259df7f8 by task syz-executor/32298

  CPU: 6 PID: 32298 Comm: syz-executor Tainted: G           OE    4.15.0-rc2+ #18
  Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
  Call Trace:
   dump_stack+0xab/0xe1
   print_address_description+0x6b/0x290
   kasan_report+0x28a/0x370
   write_mmio+0x11e/0x270 [kvm]
   emulator_read_write_onepage+0x311/0x600 [kvm]
   emulator_read_write+0xef/0x240 [kvm]
   emulator_fix_hypercall+0x105/0x150 [kvm]
   em_hypercall+0x2b/0x80 [kvm]
   x86_emulate_insn+0x2b1/0x1640 [kvm]
   x86_emulate_instruction+0x39a/0xb90 [kvm]
   handle_exception+0x1b4/0x4d0 [kvm_intel]
   vcpu_enter_guest+0x15a0/0x2640 [kvm]
   kvm_arch_vcpu_ioctl_run+0x549/0x7d0 [kvm]
   kvm_vcpu_ioctl+0x479/0x880 [kvm]
   do_vfs_ioctl+0x142/0x9a0
   SyS_ioctl+0x74/0x80
   entry_SYSCALL_64_fastpath+0x23/0x9a

The path of patched vmmcall will patch 3 bytes opcode 0F 01 C1(vmcall)
to the guest memory, however, write_mmio tracepoint always prints 8 bytes
through *(u64 *)val since kvm splits the mmio access into 8 bytes. This
leaks 5 bytes from the kernel stack (CVE-2017-17741).  This patch fixes
it by just accessing the bytes which we operate on.

Before patch:

syz-executor-5567  [007] .... 51370.561696: kvm_mmio: mmio write len 3 gpa 0x10 val 0x1ffff10077c1010f

After patch:

syz-executor-13416 [002] .... 51302.299573: kvm_mmio: mmio write len 3 gpa 0x10 val 0xc1010f

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry-picked from commit 653c41ac4729261cb356ee1aff0f3f4f342be1eb)

Orabug: 27290606
CVE: CVE-2017-17741

Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xprtrdma: Detect unreachable NFS/RDMA servers more reliably

Current NFS clients rely on connection loss to determine when to
retransmit. In particular, for protocols like NFSv4, clients no
longer rely on RPC timeouts to drive retransmission: NFSv4 servers
are required to terminate a connection when they need a client to
retransmit pending RPCs.

When a server is no longer reachable, either because it has crashed
or because the network path has broken, the server cannot actively
terminate a connection. Thus NFS clients depend on transport-level
keepalive to determine when a connection must be replaced and
pending RPCs retransmitted.

However, RDMA RC connections do not have a native keepalive
mechanism. If an NFS/RDMA server crashes after a client has sent
RPCs successfully (an RC ACK has been received for all OTW RDMA
requests), there is no way for the client to know the connection is
moribund.

In addition, new RDMA requests are subject to the RPC-over-RDMA
credit limit. If the client has consumed all granted credits with
NFS traffic, it is not allowed to send another RDMA request until
the server replies. Thus it has no way to send a true keepalive when
the workload has already consumed all credits with pending RPCs.

To address this, forcibly disconnect a transport when an RPC times
out. This prevents moribund connections from stopping the
detection of failover or other configuration changes on the server.

Note that even if the connection is still good, retransmitting
any RPC will trigger a disconnect thanks to this logic in
xprt_rdma_send_request:

/* Must suppress retransmit to maintain credits */
if (req->rl_connect_cookie == xprt->connect_cookie)
goto drop_connection;
req->rl_connect_cookie = xprt->connect_cookie;

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Orabug: 27587008

(cherry picked from commit 33849792cbcdae2b04819cfb09fe3dca0a84a11e)
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

sunrpc: Export xprt_force_disconnect()

xprt_force_disconnect() is already invoked from the socket
transport. I want to invoke xprt_force_disconnect() from the
RPC-over-RDMA transport, which is a separate module from sunrpc.ko.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Orabug: 27587008

(cherry picked from commit e2a4f4fbefc5e5b7b4435f73711b7be94f780584)
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

sunrpc: Allow xprt->ops->timer method to sleep

The transport lock is needed to protect the xprt_adjust_cwnd() call
in xs_udp_timer, but it is not necessary for accessing the
rq_reply_bytes_recvd or tk_status fields. It is correct to sublimate
the lock into UDP's xs_udp_timer method, where it is required.

The ->timer method has to take the transport lock if needed, but it
can now sleep safely, or even call back into the RPC scheduler.

This is more a clean-up than a fix, but the "issue" was introduced
by my transport switch patches back in 2005.

Fixes: 46c0ee8bc4ad ("RPC: separate xprt_timer implementations")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Orabug: 27587008

(cherry picked from commit b977b644ccf821ab1269582f7efe1d0d85faa1f6)
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

KVM: nVMX: fix guest CR4 loading when emulating L2 to L1 exit

When KVM emulates an exit from L2 to L1, it loads L1 CR4 into the
guest CR4. Before this CR4 loading, the guest CR4 refers to L2
CR4. Because these two CR4's are in different levels of guest, we
should vmx_set_cr4() rather than kvm_set_cr4() here. The latter, which
is used to handle guest writes to its CR4, checks the guest change to
CR4 and may fail if the change is invalid.

The failure may cause trouble. Consider we start
  a L1 guest with non-zero L1 PCID in use,
     (i.e. L1 CR4.PCIDE == 1 && L1 CR3.PCID != 0)
and
  a L2 guest with L2 PCID disabled,
     (i.e. L2 CR4.PCIDE == 0)
and following events may happen:

1. If kvm_set_cr4() is used in load_vmcs12_host_state() to load L1 CR4
   into guest CR4 (in VMCS01) for L2 to L1 exit, it will fail because
   of PCID check. As a result, the guest CR4 recorded in L0 KVM (i.e.
   vcpu->arch.cr4) is left to the value of L2 CR4.

2. Later, if L1 attempts to change its CR4, e.g., clearing VMXE bit,
   kvm_set_cr4() in L0 KVM will think L1 also wants to enable PCID,
   because the wrong L2 CR4 is used by L0 KVM as L1 CR4. As L1
   CR3.PCID != 0, L0 KVM will inject GP to L1 guest.

Fixes: 4704d0befb072 ("KVM: nVMX: Exiting from L2 to L1")
Cc: qemu-stable@nongnu.org
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry-picked from commit 8eb3f87d903168bdbd1222776a6b1e281f50513e)
Orabug: 27720128
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/microcode: probe CPU features on microcode update

Probe for updated CPUID features each time the microcode is
loaded. Specifically this means when the sysfs cpu/microcode
nodes are created (which is when the microcode is first loaded)
or from a user trigger via the sysfs microcode/reload interface.

Orabug: 27878230

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
(cherry-picked from commit c97a2bf2aa93390b23fbf9adb943e494fee18a18)

conflicts:
arch/x86/kernel/cpu/microcode/core.c
arch/x86/include/asm/processor.h

[Backport: call init_scattered_cpuid_features() instead of
get_cpu_cap().]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/microcode: microcode_write() should not reference boot_cpu_data

microcode_write() internally calls the AMD or Intel microcode
update logic, both of which update the cpu_data(cpu)->microcode
value. For probing speculation features however, we call
init_scattered_cpuid_features() with boot_cpu_data which is stale
and might have an old value of microcode version.

Fix this by using cpu_data() instead.

Orabug: 27878230

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry-picked from commit ea35f733dca011782ef2a3647e7f3ac284dbed2e)

Conflict:
arch/x86/kernel/cpu/microcode/core.c

[Backport: modify arch/x86/kernel/microcode_core.c instead.]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/cpufeatures: use cpu_data in init_scattered_cpuid_flags()

Post SMP init, cpu_data() contains the current cpuinfo state with
the boot_cpu_data potentially going stale.

Switch all boot_cpu_data references to cpu_data() in
init_scattered_cpuid_flags().

Orabug: 27878230

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry-picked from commit 6258d6d6ce2b9cf41830e8616ffc7841b3fddca9)

Conflict:
arch/x86/kernel/cpu/common.c

[Backport: Modify init_scattered_cpuid_flags() instead of
init_speculation_control().]

Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm/pagewalk.c: report holes in hugetlb ranges

This matters at least for the mincore syscall, which will otherwise copy
uninitialized memory from the page allocator to userspace. It is
probably also a correctness error for /proc/$pid/pagemap, but I haven't
tested that.

Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
no effect because the caller already checks for that.

This only reports holes in hugetlb ranges to callers who have specified
a hugetlb_entry callback.

This issue was found using an AFL-based fuzzer.

v2:
- don't crash on ->pte_hole==NULL (Andrew Morton)
- add Cc stable (Andrew Morton)

Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
Signed-off-by: Jann Horn <jannh@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 27913118
CVE: CVE-2017-16994

(cherry picked from commit 373c4557d2aa362702c4c2d41288fb1e54990b7c)
Conflict: one line adjust due to huge_pte_offset() interface diff
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

KEYS: don't let add_key() update an uninstantiated key

Currently, when passed a key that already exists, add_key() will call the
key's ->update() method if such exists.  But this is heavily broken in the
case where the key is uninstantiated because it doesn't call
__key_instantiate_and_link().  Consequently, it doesn't do most of the
things that are supposed to happen when the key is instantiated, such as
setting the instantiation state, clearing KEY_FLAG_USER_CONSTRUCT and
awakening tasks waiting on it, and incrementing key->user->nikeys.

It also never takes key_construction_mutex, which means that
->instantiate() can run concurrently with ->update() on the same key.  In
the case of the "user" and "logon" key types this causes a memory leak, at
best.  Maybe even worse, the ->update() methods of the "encrypted" and
"trusted" key types actually just dereference a NULL pointer when passed an
uninstantiated key.

Change key_create_or_update() to wait interruptibly for the key to finish
construction before continuing.

This patch only affects *uninstantiated* keys.  For now we still allow a
negatively instantiated key to be updated (thereby positively
instantiating it), although that's broken too (the next patch fixes it)
and I'm not sure that anyone actually uses that functionality either.

Here is a simple reproducer for the bug using the "encrypted" key type
(requires CONFIG_ENCRYPTED_KEYS=y), though as noted above the bug
pertained to more than just the "encrypted" key type:

    #include <stdlib.h>
    #include <unistd.h>
    #include <keyutils.h>

    int main(void)
    {
        int ringid = keyctl_join_session_keyring(NULL);

        if (fork()) {
            for (;;) {
                const char payload[] = "update user:foo 32";

                usleep(rand() % 10000);
                add_key("encrypted", "desc", payload, sizeof(payload), ringid);
                keyctl_clear(ringid);
            }
        } else {
            for (;;)
                request_key("encrypted", "desc", "callout_info", ringid);
        }
    }

It causes:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: encrypted_update+0xb0/0x170
    PGD 7a178067 P4D 7a178067 PUD 77269067 PMD 0
    PREEMPT SMP
    CPU: 0 PID: 340 Comm: reproduce Tainted: G      D         4.14.0-rc1-00025-g428490e38b2e #796
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    task: ffff8a467a39a340 task.stack: ffffb15c40770000
    RIP: 0010:encrypted_update+0xb0/0x170
    RSP: 0018:ffffb15c40773de8 EFLAGS: 00010246
    RAX: 0000000000000000 RBX: ffff8a467a275b00 RCX: 0000000000000000
    RDX: 0000000000000005 RSI: ffff8a467a275b14 RDI: ffffffffb742f303
    RBP: ffffb15c40773e20 R08: 0000000000000000 R09: ffff8a467a275b17
    R10: 0000000000000020 R11: 0000000000000000 R12: 0000000000000000
    R13: 0000000000000000 R14: ffff8a4677057180 R15: ffff8a467a275b0f
    FS:  00007f5d7fb08700(0000) GS:ffff8a467f200000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000018 CR3: 0000000077262005 CR4: 00000000001606f0
    Call Trace:
     key_create_or_update+0x2bc/0x460
     SyS_add_key+0x10c/0x1d0
     entry_SYSCALL_64_fastpath+0x1f/0xbe
    RIP: 0033:0x7f5d7f211259
    RSP: 002b:00007ffed03904c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000f8
    RAX: ffffffffffffffda RBX: 000000003b2a7955 RCX: 00007f5d7f211259
    RDX: 00000000004009e4 RSI: 00000000004009ff RDI: 0000000000400a04
    RBP: 0000000068db8bad R08: 000000003b2a7955 R09: 0000000000000004
    R10: 000000000000001a R11: 0000000000000246 R12: 0000000000400868
    R13: 00007ffed03905d0 R14: 0000000000000000 R15: 0000000000000000
    Code: 77 28 e8 64 34 1f 00 45 31 c0 31 c9 48 8d 55 c8 48 89 df 48 8d 75 d0 e8 ff f9 ff ff 85 c0 41 89 c4 0f 88 84 00 00 00 4c 8b 7d c8 <49> 8b 75 18 4c 89 ff e8 24 f8 ff ff 85 c0 41 89 c4 78 6d 49 8b
    RIP: encrypted_update+0xb0/0x170 RSP: ffffb15c40773de8
    CR2: 0000000000000018

Cc: <stable@vger.kernel.org> # v2.6.12+
Reported-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Biggers <ebiggers@google.com>

Orabug: 27913330
CVE: CVE-2017-15299

(cherry picked from commit 60ff5b2f547af3828aebafd54daded44cfb0807a)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

drm/vmwgfx: NULL pointer dereference in vmw_surface_define_ioctl()

Before memory allocations vmw_surface_define_ioctl() checks the
upper-bounds of a user-supplied size, but does not check if the
supplied size is 0.

Add check to avoid NULL pointer dereferences.

Cc: <stable@vger.kernel.org>
Signed-off-by: Murray McAllister <murray.mcallister@insomniasec.com>
Reviewed-by: Sinclair Yeh <syeh@vmware.com>
Orabug: 27913367
CVE: CVE-2017-7294

(cherry picked from commit 36274ab8c596f1240c606bb514da329add2a1bcd)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

vmscan: Support multiple kswapd threads per node

Page replacement is handled in the Linux Kernel in one of two ways:

1) Asynchronously via kswapd
2) Synchronously, via direct reclaim

At page allocation time the allocating task is immediately given a page
from the zone free list allowing it to go right back to work doing
whatever it was doing; Probably directly or indirectly executing
business logic.

Just prior to satisfying the allocation, free pages is checked to see if
it has reached the zone low watermark and if so, kswapd is awakened.
Kswapd will start scanning pages looking for inactive pages to evict to
make room for new page allocations. The work of kswapd allows tasks to
continue allocating memory from their respective zone free list without
incurring any delay.

When the demand for free pages exceeds the rate that kswapd tasks can
supply them, page allocation works differently. Once the allocating task
finds that the number of free pages is at or below the zone min
watermark, the task will no longer pull pages from the free list.
Instead, the task will run the same CPU-bound routines as kswapd to
satisfy its own allocation by scanning and evicting pages. This is
called a direct reclaim.

The time spent performing a direct reclaim can be substantial, often
taking tens to hundreds of milliseconds for small order0 allocations to
half a second or more for order9 huge-page allocations. In fact, kswapd
is not actually required on a linux system. It exists for the sole
purpose of optimizing performance by preventing direct reclaims.

When memory shortfall is sufficient to trigger direct reclaims, they can
occur in any task that is running on the system. A single aggressive
memory allocating task can set the stage for collateral damage to occur
in small tasks that rarely allocate additional memory. Consider the
impact of injecting an additional 100ms of latency when nscd allocates
memory to facilitate caching of a DNS query.

The presence of direct reclaims 10 years ago was a fairly reliable
indicator that too much was being asked of a Linux system. Kswapd was
likely wasting time scanning pages that were ineligible for eviction.
Adding RAM or reducing the working set size would usually make the
problem go away. Since then hardware has evolved to bring a new struggle
for kswapd. Storage speeds have increased by orders of magnitude while
CPU clock speeds have actually slowed down. This presents a throughput
problem for a single threaded kswapd that will get worse with each
generation of new hardware.

------------
Test Details
------------

The tests below were designed with the assumption that a kswapd
bottleneck is best demonstrated using filesystem reads. This way, the
inactive list will be full of clean pages, simplifying the analysis and
allowing kswapd to achieve the highest possible steal rate. Maximum
steal rates for kswapd are likely to be the same or lower for any other
mix of page types on the system.

Tests were run on a 2U Oracle X7-2L with 52 Intel Xeon Skylake 2GHz
cores, 756GB of RAM and 8 x 3.6 TB NVMe Solid State Disk drives. Each
drive has an XFS filesystem mounted separately as /d0 through /d7. NVMe
drives require multiple concurrent streams to show their potential, so I
created 11 250GB zero-filled files on each drive so that I could test
with parallel reads.

The test script runs in multiple stages. At each stage, the number of dd
tasks run concurrently is increased by 2. I did not include all of the
test output for brevity.

During each stage dd tasks are launched to read from each drive in a
round robin fashion until the specified number of tasks for the stage
has been reached. Then iostat, vmstat and top are started in the
background with 10 second intervals. After five minutes, all of the dd
tasks are killed and the iostat, vmstat and top output is parsed in
order to report the following:

CPU consumption
- sy: aggregate kernel mode CPU consumption from vmstat output. The
  value doesn't tend to fluctuate much so I just grab the highest value.
  Each sample is averaged over 10 seconds
- dd_cpu: for all of the dd tasks averaged across the top samples since
  there is a lot of variation.

Throughput
- in Kbytes
- Command is iostat -x -d 10 -g total

This first test performs reads using O_DIRECT in order to show the peak
throughput that can be obtained using these drives. It also demonstrates
how rapidly throughput scales as the number of dd tasks are increased.

The dd command for this test looks like this:

Command Used: dd iflag=direct if=/d${i}/$n of=/dev/null bs=4M

dd sy dd_cpu throughput
6  1  4.52   14966994.50
10 1  4.94   21503269.37
16 1  4.70   25791251.00
22 1  5.02   26139553.00
28 1  4.85   26242989.00
34 2  4.53   26253264.20
40 2  3.82   26265978.60
46 2  3.39   26256091.80
52 2  3.06   26256913.60
58 2  2.74   26256988.40
64 2  2.50   26256534.20
70 2  2.27   26255088.00
76 2  2.12   26247909.00
80 2  1.99   26251164.80

Throughput peaked with 40 dd tasks at 26265978.60 KB/s. Very little
system CPU was consumed as expected the drives DMA directly into the
user address space when using O_DIRECT.

The remaining tests do not use O_DIRECT. We drop the page cache before
testing and stop the test as soon as kswapd wakes up.

dd sy dd_cpu throughput
6  2  30.34  5245348.50
10 3 32.53 7735288.00
16 5 32.78 11059243.20
22 6 30.77 13371912.80
28 8 31.52 16092092.00
34 10 30.12 18000076.80
40 11 29.34 19368494.40
46 11 26.45 20450313.60
52 13 25.47 21249290.40
58 13 23.75 22008188.80
64 13 21.38 22298248.80
70 15 21.39 22442940.80
76 14 18.82 22876260.80
80 15 19.45 23143716.80

Each read has to pause after the buffer in kernel space is populated
while that data is added to the pagecache and copied into the user
address space. For this reason, more parallel streams are required to
achieve peak throughput. The copy operation consumes substantially more
CPU than direct IO as expected.

The next test measures throughput after kswapd starts running. This is
the same test only we wait for kswapd to wake up before we start
collecting metrics.

The script actually keeps track of a few things that were not mentioned
earlier. It tracks direct reclaims and page scans by watching the
metrics in /proc/vmstat. CPU consumption for kswapd is tracked the same
way it is tracked for dd.

Since the test is 100% reads, you can assume that the page steal rate
for kswapd and direct reclaims is almost identical to the scan rate.

1 kswapd thread per node
dd sy dd_cpu kswapd0 kswapd1 throughput  dr's  pgscan_kswapd pgscan_direct
10 4  31.56  24.86   23.56   7828015.60  0     460668253     0
16 7  36.10  65.94   71.74   11149401.80 0     900894848     0
22 10 37.79  91.94   94.25   14271445.20 179   1149779893    4236610
28 14 46.04  86.71   84.39   14633873.80 14624 829782742     346638611
34 16 41.23  85.76   84.04   16058195.40 16303 927594668     386370834
40 20 47.65  69.78   68.79   15381517.80 28538 566746661     676222823
46 22 45.76  64.39   64.50   15941522.40 32567 510483237     771659039
52 25 47.40  60.56   63.15   15189850.20 34504 422051924     816932307
58 29 48.21  53.44   57.19   15191931.40 38313 330596133     907319630
64 32 49.88  51.08   51.10   15073485.20 41939 233133908     993009680
70 36 50.71  51.32   51.54   15265733.00 43348 209193894     1026357511
76 40 51.78  51.39   54.38   15091290.20 43804 192167798     1037072462
80 44 52.95  48.91   55.95   15009935.60 44218 177718893     1046802606

Look closely at the scan statistics and the CPU consumption numbers and
it should be clear that the bulk of the CPU consumption is occuring in
the context of the dd tasks due to direct reclaims, not kswapd.

Same test, more kswapd tasks:

6 kswapd threads per node
dd sy dd_cpu kswapd0 kswapd1 throughput  dr's  pgscan_kswapd pgscan_direct
10 4  33.19  6.98    6.44    8184050.97  0     460877355     0
16 10 41.61  27.99   28.26   11533556.80 0     941456735     0
22 12 39.00  28.12   29.67   14303265.20 10    1170356251    237431
28 15 37.53  38.01   40.42   16449001.40 30    1355387318    711292
34 19 38.87  49.81   51.33   18094928.20 0     1495622630    0
40 22 37.62  56.93   59.27   19562580.80 0     1618461307    0
46 25 36.51  64.00   66.34   20800868.60 0     1715162179    0
52 28 36.89  70.68   74.60   21650189.60 0     1787311285    0
58 34 37.44  80.59   81.43   22395273.00 1190  1794721827    28089474
64 44 50.22  67.36   76.96   21848111.20 18150 1105060289    429342513
70 46 55.57  56.59   64.22   18766118.20 27918 724659653     660301547
76 50 42.37  67.79   75.83   23688889.40 18603 1171174674    440088674
80 49 40.14  72.05   79.60   22350506.00 15680 1310470634    370843890

With 58 dd tasks, throughput is roughly the same as what we saw without
memory pressure. Ten additional kswapd tasks (5 per node) resulted in a
17% increase in aggregate kernel mode CPU consumption.

NOTE: The kswapd tasks were originally tracked with an array of task
structs in each pgdata structure. Sadly, any changes to the pg_data_t
resulted in KABI breakage. Look for the following definition that was
used as a workaround:

static struct task_struct *kswapd_list[MAX_NUMNODES][MAX_KSWAPD_THREADS];

Orabug: 27913411

Signed-off-by: Buddy Lumpkin <buddy.lumpkin@oracle.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: John Sobecki <john.sobecki@oracle.com>
Reviewed by: Henry Willard <henry.willard@oracle.com

Signed-off-by: Brian Maly <brian.maly@oracle.com>

tcp: don't use F-RTO on non-recurring timeouts

Currently F-RTO may repeatedly send new data packets on non-recurring
timeouts in CA_Loss mode. This is a bug because F-RTO (RFC5682)
should only be used on either new recovery or recurring timeouts.

This exacerbates the recovery progress during frequent timeout &
repair, because we prioritize sending new data packets instead of
repairing the holes when the bandwidth is already scarce.

Fix it by correcting the test of a new recovery episode.

Orabug: 27901860

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit f82b681a511f4d61069e9586a9cf97bdef371ef3)

Reviewed-by: Hakon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net/rds: ib: Release correct number of frags

Commit c682e8474bd4 ("net/rds: reduce memory footprint during
ib_post_recv in IB transport") introduces an SG list instead of a
single contiguously fragment. When rebuilding the caches, it attempts
to release the number of fragments used by the new connection,
independent of the actual number of fragments used by the cache. This
leads to a kernel crash. Instead, release the correct number of
fragments.

Orabug: 27924161

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

crypto: rng - Remove old low-level rng interface

Orabug: 27926676
CVE: CVE-2017-15116

Now that all rng implementations have switched over to the new
interface, we can remove the old low-level interface.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 94f1bb15bed84ad6c893916b7e7b9db6f1d7eec6)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

crypto: drbg - Convert to new rng interface

Orabug: 27926676
CVE: CVE-2017-15116

This patch converts the DRBG implementation to the new low-level
rng interface.

This allows us to get rid of struct drbg_gen by using the new RNG
API instead.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Stephan Mueller <smueller@chronox.de>
(cherry picked from commit 8fded5925d0a733c46f8d0b5edd1c9b315882b1d)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
crypto/drbg.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

crypto: ansi_cprng - Convert to new rng interface

Orabug: 27926676
CVE: CVE-2017-15116

This patch ocnverts the ANSI CPRNG implementation to the new
low-level rng interface.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
(cherry picked from commit e7c2422a839bfc6876a2f7a9b283bb2963f0287b)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

crypto: krng - Convert to new rng interface

Orabug: 27926676
CVE: CVE-2017-15116

This patch ocnverts the KRNG implementation to the new low-level
rng interface.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit e33cf2c5aab7d0012e7890089e89ae2466c2449c)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

RDS: Heap OOB write in rds_message_alloc_sgs()

Orabug: 27934066
CVE: CVE-2018-5332

When args->nr_local is 0, nr_pages gets also 0 due some size
calculation via rds_rm_size(), which is later used to allocate
pages for DMA, this bug produces a heap Out-Of-Bound write access
to a specific memory region.

Signed-off-by: Mohamed Ghannam <simo.ghannam@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c095508770aebf1b9218e77026e48345d719b17c)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: Fix double free and memory corruption in get_net_ns_by_id()

Orabug: 27934789
CVE: CVE-2017-15129

(I can trivially verify that that idr_remove in cleanup_net happens
after the network namespace count has dropped to zero --EWB)

Function get_net_ns_by_id() does not check for net::count
after it has found a peer in netns_ids idr.

It may dereference a peer, after its count has already been
finaly decremented. This leads to double free and memory
corruption:

put_net(peer)                                   rtnl_lock()
atomic_dec_and_test(&peer->count) [count=0]     ...
__put_net(peer)                                 get_net_ns_by_id(net, id)
  spin_lock(&cleanup_list_lock)
  list_add(&net->cleanup_list, &cleanup_list)
  spin_unlock(&cleanup_list_lock)
queue_work()                                      peer = idr_find(&net->netns_ids, id)
  |                                               get_net(peer) [count=1]
  |                                               ...
  |                                               (use after final put)
  v                                               ...
  cleanup_net()                                   ...
    spin_lock(&cleanup_list_lock)                 ...
    list_replace_init(&cleanup_list, ..)          ...
    spin_unlock(&cleanup_list_lock)               ...
    ...                                           ...
    ...                                           put_net(peer)
    ...                                             atomic_dec_and_test(&peer->count) [count=0]
    ...                                               spin_lock(&cleanup_list_lock)
    ...                                               list_add(&net->cleanup_list, &cleanup_list)
    ...                                               spin_unlock(&cleanup_list_lock)
    ...                                             queue_work()
    ...                                           rtnl_unlock()
    rtnl_lock()                                   ...
    for_each_net(tmp) {                           ...
      id = __peernet2id(tmp, peer)                ...
      spin_lock_irq(&tmp->nsid_lock)              ...
      idr_remove(&tmp->netns_ids, id)             ...
      ...                                         ...
      net_drop_ns()                               ...
net_free(peer)                            ...
    }                                             ...
  |
  v
  cleanup_net()
    ...
    (Second free of peer)

Also, put_net() on the right cpu may reorder with left's cpu
list_replace_init(&cleanup_list, ..), and then cleanup_list
will be corrupted.

Since cleanup_net() is executed in worker thread, while
put_net(peer) can happen everywhere, there should be
enough time for concurrent get_net_ns_by_id() to pick
the peer up, and the race does not seem to be unlikely.
The patch fixes the problem in standard way.

(Also, there is possible problem in peernet2id_alloc(), which requires
check for net::count under nsid_lock and maybe_get_net(peer), but
in current stable kernel it's used under rtnl_lock() and it has to be
safe. Openswitch begun to use peernet2id_alloc(), and possibly it should
be fixed too. While this is not in stable kernel yet, so I'll send
a separate message to netdev@ later).

Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Fixes: 0c7aecd4bde4 "netns: add rtnl cmd to add and get peer netns ids"
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 21b5944350052d2583e82dd59b19a9ba94a007f0)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
net/core/net_namespace.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

vhost/scsi: fix reuse of &vq->iov[out] in response

The address of the iovec &vq->iov[out] is not guaranteed to contain the scsi
command's response iovec throughout the lifetime of the command. Rather, it
is more likely to contain an iovec from an immediately following command
after looping back around to vhost_get_vq_desc(). Pass along the iovec
entirely instead.

Fixes: 79c14141a487 ("vhost/scsi: Convert completion path to use copy_to_iter")
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
(cherry picked from commit a77ec83a57890240c546df00ca5df1cdeedb1cc3)

Orabug: 27928330

Signed-off-by: Gayatri Vasudevan <gayatri.vasudevan@oracle.com>
Reviewed-by: Mridula Shastry <mridula.c.shastry@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

kernel.spec: add requires system-release for OL7

Orabug: 27955380

Add "Requires: system-release" to OL7 specfile

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Victor Erminpour <victor.erminpour@oracle.com>

x86/kernel/traps.c: fix trace_die_notifier return value

When triggering a int3 directly, the trace_die_notifier() actually returns 1
(whereas all other notifiers return 0), and that 1 value was being interpreted
as an indicator that DTrace handled the trap and that emulation is needed. The
codei, from that point on, took a branch that is only to be used when the trap
occurs in kernel code, which is not good when it was actually triggered from
userspace.

OraBug: 27895315
CVE: CVE-2018-8897

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/entry/64: Dont use IST entry for #BP stack

There's nothing IST-worthy about #BP/int3. We don't allow kprobes
in the small handful of places in the kernel that run at CPL0 with
an invalid stack, and 32-bit kernels have used normal interrupt
gates for #BP forever.

Furthermore, we don't allow kprobes in places that have usergs while
in kernel mode, so "paranoid" is also unnecessary.

OraBug: 27895315
CVE: CVE-2018-8897

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit d8ba61ba58c88d5207c1ba2f7d9a2280e7d03be9)

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

kvm/x86: fix icebp instruction handling

The undocumented 'icebp' instruction (aka 'int1') works pretty much like
'int3' in the absense of in-circuit probing equipment (except,
obviously, that it raises #DB instead of raising #BP), and is used by
some validation test-suites as such.

But Andy Lutomirski noticed that his test suite acted differently in kvm
than on bare hardware.

The reason is that kvm used an inexact test for the icebp instruction:
it just assumed that an all-zero VM exit qualification value meant that
the VM exit was due to icebp.

That is not unlike the guess that do_debug() does for the actual
exception handling case, but it's purely a heuristic, not an absolute
rule. do_debug() does it because it wants to ascribe _some_ reasons to
the #DB that happened, and an empty %dr6 value means that 'icebp' is the
most likely casue and we have no better information.

But kvm can just do it right, because unlike the do_debug() case, kvm
actually sees the real reason for the #DB in the VM-exit interruption
information field.

So instead of relying on an inexact heuristic, just use the actual VM
exit information that says "it was 'icebp'".

Right now the 'icebp' instruction isn't technically documented by Intel,
but that will hopefully change. The special "privileged software
exception" information _is_ actually mentioned in the Intel SDM, even
though the cause of it isn't enumerated.

OraBug: 27895356
CVE: CVE-2018-1087

Reported-by: Andy Lutomirski <luto@kernel.org>
Tested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 32d43cd391bacb5f0814c2624399a5dad3501d09)

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

perf/hwbp: Simplify the perf-hwbp code, fix documentation

Orabug: 27947602
CVE: CVE-2018-100199

Annoyingly, modify_user_hw_breakpoint() unnecessarily complicates the
modification of a breakpoint - simplify it and remove the pointless
local variables.

Also update the stale Docbook while at it.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit f67b15037a7a50c57f72e69a6d59941ad90a0f0f)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: iscsi_tcp: set BDI_CAP_STABLE_WRITES when data digest enabled

iscsi tcp will first send out data, then calculate and send data
digest. If we don't have BDI_CAP_STABLE_WRITES, the page cache will be
written in spite of the on going writeback. Consequently, wrong digest
will be got and sent to target.

To fix this, set BDI_CAP_STABLE_WRITES when data digest is enabled
in iscsi_tcp .slave_configure callback.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Acked-by: Chris Leech <cleech@redhat.com>
Acked-by: Lee Duncan <lduncan@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(backport upstream commit 89d0c804392bb962553f23dc4c119d11b6bd1675)

conflict:
drivers/scsi/iscsi_tcp.c

Orabug: 27726302

Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

block: fix bio_will_gap() for first bvec with offset

Commit 729204ef49ec("block: relax check on sg gap") allows us to merge
bios, if both are physically contiguous. This change can merge a huge
number of small bios, through mkfs for example, mkfs.ntfs running time
can be decreased to ~1/10.

But if one rq starts with a non-aligned buffer (the 1st bvec's bv_offset
is non-zero) and if we allow the merge, it is quite difficult to respect
sg gap limit, especially the max segment size, or we risk having an
unaligned virtual boundary. This patch tries to avoid the issue by
disallowing a merge, if the req starts with an unaligned buffer.

Also add comments to explain why the merged segment can't end in
unaligned virt boundary.

Fixes: 729204ef49ec ("block: relax check on sg gap")
Tested-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Rewrote parts of the commit message and comments.

Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 5a8d75a1b8c99bdc926ba69b7b7dbe4fae81a5af)

Orabug: 27775588

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

block: relax check on sg gap

If the last bvec of the 1st bio and the 1st bvec of the next
bio are physically contigious, and the latter can be merged
to last segment of the 1st bio, we should think they don't
violate sg gap(or virt boundary) limit.

Both Vitaly and Dexuan reported lots of unmergeable small bios
are observed when running mkfs on Hyper-V virtual storage, and
performance becomes quite low. This patch fixes that performance
issue.

The same issue should exist on NVMe, since it sets virt boundary too.

Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reported-by: Dexuan Cui <decui@microsoft.com>
Tested-by: Dexuan Cui <decui@microsoft.com>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 729204ef49ec00b788ce23deb9eb922a5769f55d)

Orabug: 27775588

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

block: don't optimize for non-cloned bio in bio_get_last_bvec()

For !BIO_CLONED bio, we can use .bi_vcnt safely, but it
doesn't mean we can just simply return .bi_io_vec[.bi_vcnt - 1]
because the start postion may have been moved in the middle of
the bvec, such as splitting in the middle of bvec.

Fixes: 7bcd79ac50d9(block: bio: introduce helpers to get the 1st and last bvec)
Cc: stable@vger.kernel.org
Reported-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 90d0f0f11588ec692c12f9009089b398be395184)

Orabug: 27775588

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

block: merge: get the 1st and last bvec via helpers

This patch applies the two introduced helpers to
figure out the 1st and last bvec.

Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit e827091cb1bcd8e718ac3657845fb809c0b93324)

Orabug: 27775588

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

block: get the 1st and last bvec via helpers

This patch applies the two introduced helpers to
figure out the 1st and last bvec, and fixes the
original way after bio splitting.

Cc: stable@vger.kernel.org
Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 25e71a99f10e444cd00bb2ebccb11e1c9fb672b1)

Orabug: 27775588

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

block: check virt boundary in bio_will_gap()

In the following patch, the way for figuring out
the last bvec will be changed with a bit cost introduced,
so return immediately if the queue doesn't have virt
boundary limit. Actually most of devices have not
this limit.

Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit e0af29171aa8912e1ca95023b75ef336cd70d661)

Orabug: 27775588

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

block: bio: introduce helpers to get the 1st and last bvec

The bio passed to bio_will_gap() may be fast cloned from upper
layer(dm, md, bcache, fs, ...), or from bio splitting in block
core.

Unfortunately bio_will_gap() just figures out the last bvec via
'bi_io_vec[prev->bi_vcnt - 1]' directly, and this way is obviously
wrong.

This patch introduces two helpers for getting the first and last
bvec of one bio for fixing the issue.

Cc: stable@vger.kernel.org
Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 7bcd79ac50d9d83350a835bdb91c04ac9e098412)

Orabug: 27775588

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Failing to send a CLOSE if file is opened WRONLY and server reboots on a 4.x mount

A test case is as the description says:
open(foobar, O_WRONLY);
sleep() --> reboot the server
close(foobar)

The bug is because in nfs4state.c in nfs4_reclaim_open_state() a few
line before going to restart, there is
clear_bit(NFS4CLNT_RECLAIM_NOGRACE, &state->flags).

NFS4CLNT_RECLAIM_NOGRACE is a flag for the client states not open
owner states. Value of NFS4CLNT_RECLAIM_NOGRACE is 4 which is the
value of NFS_O_WRONLY_STATE in nfs4_state->flags. So clearing it wipes
out state and when we go to close it, “call_close” doesn’t get set as
state flag is not set and CLOSE doesn’t go on the wire.

Signed-off-by: Olga Kornievskaia <aglo@umich.edu>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Orabug: 27848303

(cherry picked from commit a41cbe86df3afbc82311a1640e20858c0cd7e065)
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Reviewed-by: Manjunath Patil <manjunath.b.patil@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: add validity checks for bitmap block numbers

An privileged attacker can cause a crash by mounting a crafted ext4
image which triggers a out-of-bounds read in the function
ext4_valid_block_bitmap() in fs/ext4/balloc.c.

This issue has been assigned CVE-2018-1093.

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=199181
BugLink: https://bugzilla.redhat.com/show_bug.cgi?id=1560782
Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
(cherry picked from commit 7dac4a1726a9c64a517d595c40e95e2d0d135f6f)

Orabug: 27854373
CVE: CVE-2018-1093

Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
conflict:
    fs/ext4/balloc.c
    fs/ext4/ialloc.c
    Missing commit 6a797d2737 and
    EFSCORRUPTED missing. Use EIO instead.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

ocfs2: Take inode cluster lock before moving reflinked inode from orphan dir

Orabug: 27869411

While reflinking an inode, we create a new inode in orphan directory, then
take EX lock on it, reflink the original inode to orphan inode and release
EX lock. Once the lock is released another node might request it in PR mode
which causes downconvert of the lock to PR mode.

Later we attempt to initialize security acl for the orphan inode and move
it to the reflink destination. However, while doing this we dont take EX
lock on the inode. So effectively, we are doing this and accessing the
journal for this inode while holding PR lock. While accessing the journal,
we make

ci->ci_last_trans = journal->j_trans_id

At this point, if there is another downconvert request on this inode from
another node (PR->NL), we will trip on the following condition in
ocfs2_ci_checkpointed()

BUG_ON(lockres->l_level != DLM_LOCK_EX && !checkpointed);

because we hold the lock in PR mode and journal->j_trans_id is not greater
than ci_last_trans for the inode.

Fix this by taking orphan inode cluster lock in EX mode before
initializing security and moving orphan inode to reflink destination.
Use the __tracker variant while taking inode lock to avoid recursive
locking in the ocfs2_init_security_and_acl() call chain.

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Input: gtco - fix potential out-of-bound access

parse_hid_report_descriptor() has a while (i < length) loop, which
only guarantees that there's at least 1 byte in the buffer, but the
loop body can read multiple bytes which causes out-of-bounds access.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
(cherry picked from commit a50829479f58416a013a4ccca791336af3c584c7)

Orabug: 27869844
CVE: CVE-2017-16643

Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Input: ims-psu - check if CDC union descriptor is sane

Before trying to use CDC union descriptor, try to validate whether that it
is sane by checking that intf->altsetting->extra is big enough and that
descriptor bLength is not too big and not too small.

Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
(cherry picked from commit ea04efee7635c9120d015dcdeeeb6988130cb67a)

Orabug: 27870333
CVE: CVE-2017-16645

Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

vfio/pci: Virtualize Maximum Payload Size

With virtual PCI-Express chipsets, we now see userspace/guest drivers
trying to match the physical MPS setting to a virtual downstream port.
Of course a lone physical device surrounded by virtual interconnects
cannot make a correct decision for a proper MPS setting. Instead,
let's virtualize the MPS control register so that writes through to
hardware are disallowed. Userspace drivers like QEMU assume they can
write anything to the device and we'll filter out anything dangerous.
Since mismatched MPS can lead to AER and other faults, let's add it
to the kernel side rather than relying on userspace virtualization to
handle it.

OraBug: 27876914

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
(cherry picked from commit 523184972b282cd9ca17a76f6ca4742394856818)
Signed-off-by: Wim ten Have <wim.ten.have@oracle.com>
Reviewed-by: Ross Philipson <ross.philipson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

vfio-pci: Virtualize PCIe & AF FLR

We use a BAR restore trick to try to detect when a user has performed
a device reset, possibly through FLR or other backdoors, to put things
back into a working state.  This is important for backdoor resets, but
we can actually just virtualize the "front door" resets provided via
PCIe and AF FLR.  Set these bits as virtualized + writable, allowing
the default write to set them in vconfig, then we can simply check the
bit, perform an FLR of our own, and clear the bit.  We don't actually
have the granularity in PCI to specify the type of reset we want to
do, but generally devices don't implement both PCIe and AF FLR and
we'll favor these over other types of reset, so we should generally
lineup.  We do test whether the device provides the requested FLR type
to stay consistent with hardware capabilities though.

This seems to fix several instance of devices getting into bad states
with userspace drivers, like dpdk, running inside a VM.

OraBug: 27876914

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Greg Rose <grose@lightfleet.com>
(cherry picked from commit ddf9dc0eb5314d6dac8b19b1cc37c739c6896e7e)
Signed-off-by: Wim ten Have <wim.ten.have@oracle.com>
Reviewed-by: Ross Philipson <ross.philipson@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

uek-rpm: Disable DMA CMA

Change the following kernel config:
CONFIG_CMA_SIZE_MBYTES=16
to
CONFIG_CMA_SIZE_MBYTES=0
Orabug: 27892359

Reviewed-by: Thomas Tai <thomas.tai@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

nvme-pci: fix multiple ctrl removal scheduling

Commit c5f6ce97c1210 tries to address multiple resets but fails as
work_busy doesn't involve any synchronization and can fail.  This is
reproducible easily as can be seen by WARNING below which is triggered
with line:

WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING)

Allowing multiple resets can result in multiple controller removal as
well if different conditions inside nvme_reset_work fail and which
might deadlock on device_release_driver.

[  480.327007] WARNING: CPU: 3 PID: 150 at drivers/nvme/host/pci.c:1900 nvme_reset_work+0x36c/0xec0
[  480.327008] Modules linked in: rfcomm fuse nf_conntrack_netbios_ns nf_conntrack_broadcast...
[  480.327044]  btusb videobuf2_core ghash_clmulni_intel snd_hwdep cfg80211 acer_wmi hci_uart..
[  480.327065] CPU: 3 PID: 150 Comm: kworker/u16:2 Not tainted 4.12.0-rc1+ #13
[  480.327065] Hardware name: Acer Predator G9-591/Mustang_SLS, BIOS V1.10 03/03/2016
[  480.327066] Workqueue: nvme nvme_reset_work
[  480.327067] task: ffff880498ad8000 task.stack: ffffc90002218000
[  480.327068] RIP: 0010:nvme_reset_work+0x36c/0xec0
[  480.327069] RSP: 0018:ffffc9000221bdb8 EFLAGS: 00010246
[  480.327070] RAX: 0000000000460000 RBX: ffff880498a98128 RCX: dead000000000200
[  480.327070] RDX: 0000000000000001 RSI: ffff8804b1028020 RDI: ffff880498a98128
[  480.327071] RBP: ffffc9000221be50 R08: 0000000000000000 R09: 0000000000000000
[  480.327071] R10: ffffc90001963ce8 R11: 000000000000020d R12: ffff880498a98000
[  480.327072] R13: ffff880498a53500 R14: ffff880498a98130 R15: ffff880498a98128
[  480.327072] FS:  0000000000000000(0000) GS:ffff8804c1cc0000(0000) knlGS:0000000000000000
[  480.327073] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  480.327074] CR2: 00007ffcf3c37f78 CR3: 0000000001e09000 CR4: 00000000003406e0
[  480.327074] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  480.327075] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  480.327075] Call Trace:
[  480.327079]  ? __switch_to+0x227/0x400
[  480.327081]  process_one_work+0x18c/0x3a0
[  480.327082]  worker_thread+0x4e/0x3b0
[  480.327084]  kthread+0x109/0x140
[  480.327085]  ? process_one_work+0x3a0/0x3a0
[  480.327087]  ? kthread_park+0x60/0x60
[  480.327102]  ret_from_fork+0x2c/0x40
[  480.327103] Code: e8 5a dc ff ff 85 c0 41 89 c1 0f.....

This patch addresses the problem by using state of controller to
decide whether reset should be queued or not as state change is
synchronizated using controller spinlock.  Also cancel_work_sync is
used to make sure remove cancels the reset_work and waits for it to
finish.  This patch also changes return value from -ENODEV to more
appropriate -EBUSY if nvme_reset fails to change state.

Fixes: c5f6ce97c1210 ("nvme: don't schedule multiple resets")
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
(cherry picked from 82b057caefaff2a891f821a617d939f46e03e844)
Orabug: 27892359

Reviewed-by: Thomas Tai <thomas.tai@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

nvme-pci: Fix nvme queue cleanup if IRQ setup fails

This patch fixes nvme queue cleanup if requesting an IRQ handler for
the queue's vector fails. It does this by resetting the cq_vector to
the uninitialized value of -1 so it is ignored for a controller reset.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
[changelog updates, removed misc whitespace changes]
Signed-off-by: Keith Busch <keith.busch@intel.com>
(cherry picked from f25a2dfc20e3a3ed8fe6618c331799dd7bd01190)
Orabug: 27892359

Reviewed-by: Thomas Tai <thomas.tai@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

nvme/pci: Fix stuck nvme reset

The controller state is set to resetting prior to disabling the
controller, so this patch accounts for that state when deciding if it
needs to freeze the queues. Without this, an 'nvme reset /dev/nvme0'
blocks forever because the queues were never frozen.

Fixes: 82b057caefaf ("nvme-pci: fix multiple ctrl removal scheduling")
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from ebef7368571d88f0f80b817e6898075c62265b4e)
Orabug: 27892359

Reviewed-by: Thomas Tai <thomas.tai@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

nvme: don't schedule multiple resets

The queue_work only fails if the work is pending, but not yet running. If
the work is running, the work item would get requeued, triggering a
double reset. If the first reset fails for any reason, the second
reset triggers:

WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING)

Hitting that schedules controller deletion for a second time, which
potentially takes a reference on the device that is being deleted.
If the reset occurs at the same time as a hot removal event, this causes
a double-free.

This patch has the reset helper function check if the work is busy
prior to queueing, and changes all places that schedule resets to use
this function. Since most users don't want to sync with that work, the
"flush_work" is moved to the only caller that wants to sync.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg<sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from c5f6ce97c12104668784ee17fb927c52a944d3d8)
Orabug: 27892359

Reviewed-by: Thomas Tai <thomas.tai@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

blk-mq: fix use-after-free in blk_mq_free_tag_set()

tags is freed in blk_mq_free_rq_map() and should not be used after that.
The problem doesn't manifest if CONFIG_CPUMASK_OFFSTACK is false because
free_cpumask_var() is nop.

tags->cpumask is allocated in blk_mq_init_tags() so it's natural to
free cpumask in its counter part, blk_mq_free_tags().

Fixes: f26cdc8536ad ("blk-mq: Shared tag enhancements")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Keith Busch <keith.busch@intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from f42d79ab67322e51b92dd7aa965e310c71352a64 )
Orabug: 27892359

Reviewed-by: Thomas Tai <thomas.tai@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

USB: core: prevent malicious bNumInterfaces overflow

commit 48a4ff1c7bb5a32d2e396b03132d20d552c0eca7 upstream.

A malicious USB device with crafted descriptors can cause the kernel
to access unallocated memory by setting the bNumInterfaces value too
high in a configuration descriptor. Although the value is adjusted
during parsing, this adjustment is skipped in one of the error return
paths.

This patch prevents the problem by setting bNumInterfaces to 0
initially. The existing code already sets it to the proper value
after parsing is complete.

Orabug: 27895909

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 4c5ae6a301a5415d1334f6c655bebf91d475bd89)
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Reviewed-by: junxiao.bi@oracle.com
Reviewed-by: jack.schwartz@oracle.com
Signed-off-by: Brian Maly <brian.maly@oracle.com>

driver core: platform: fix race condition with driver_override

The driver_override implementation is susceptible to race condition when
different threads are reading vs storing a different driver override.
Add locking to avoid race condition.

Fixes: 3d713e0e382e ("driver core: platform: add device binding path 'driver_override'")
Cc: stable@vger.kernel.org
Signed-off-by: Adrian Salido <salidoa@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 6265539776a0810b7ce6398c27866ddb9c6bd154)

Orabug: 27897874
CVE: CVE-2017-12146

Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

usb/core: usb_alloc_dev(): fix setting of ->portnum

With commit 69bec7259853 ("USB: core: let USB device know device node"),
the port1 argument of usb_alloc_dev() gets overwritten as follows:

  ... usb_alloc_dev(..., unsigned port1)
  {
    ...
    if (!parent->parent) {
      port1 = usb_hcd_find_raw_port_number(..., port1);
    }
    ...
  }

Later on, this now overwritten port1 gets assigned to ->portnum:

  dev->portnum = port1;

However, since xhci_find_raw_port_number() isn't idempotent, the
aforementioned commit causes a number of KASAN splats like the following:

  BUG: KASAN: slab-out-of-bounds in xhci_find_raw_port_number+0x98/0x170
                                       at addr ffff8801d9311670
  Read of size 8 by task kworker/2:1/87
  [...]
  Workqueue: usb_hub_wq hub_event
   0000000000000188 000000005814b877 ffff8800cba17588 ffffffff8191447e
   0000000041b58ab3 ffffffff82a03209 ffffffff819143a2 ffffffff82a252f4
   ffff8801d93115e0 0000000000000188 ffff8801d9311628 ffff8800cba17588
  Call Trace:
   [<ffffffff8191447e>] dump_stack+0xdc/0x15e
   [<ffffffff819143a2>] ? _atomic_dec_and_lock+0xa2/0xa2
   [<ffffffff814e2cd1>] ? print_section+0x61/0xb0
   [<ffffffff814e4939>] print_trailer+0x179/0x2c0
   [<ffffffff814f0d84>] object_err+0x34/0x40
   [<ffffffff814f4388>] kasan_report_error+0x2f8/0x8b0
   [<ffffffff814eb91e>] ? __slab_alloc+0x5e/0x90
   [<ffffffff812178c0>] ? __lock_is_held+0x90/0x130
   [<ffffffff814f5091>] kasan_report+0x71/0xa0
   [<ffffffff814ec082>] ? kmem_cache_alloc_trace+0x212/0x560
   [<ffffffff81d99468>] ? xhci_find_raw_port_number+0x98/0x170
   [<ffffffff814f33d4>] __asan_load8+0x64/0x70
   [<ffffffff81d99468>] xhci_find_raw_port_number+0x98/0x170
   [<ffffffff81db0105>] xhci_setup_addressable_virt_dev+0x235/0xa10
   [<ffffffff81d9ea51>] xhci_setup_device+0x3c1/0x1430
   [<ffffffff8121cddd>] ? trace_hardirqs_on+0xd/0x10
   [<ffffffff81d9fac0>] ? xhci_setup_device+0x1430/0x1430
   [<ffffffff81d9fad3>] xhci_address_device+0x13/0x20
   [<ffffffff81d2081a>] hub_port_init+0x55a/0x1550
   [<ffffffff81d28705>] hub_event+0xef5/0x24d0
   [<ffffffff81d27810>] ? hub_port_debounce+0x2f0/0x2f0
   [<ffffffff8195e1ee>] ? debug_object_deactivate+0x1be/0x270
   [<ffffffff81210203>] ? print_rt_rq+0x53/0x2d0
   [<ffffffff8121657d>] ? trace_hardirqs_off+0xd/0x10
   [<ffffffff8226acfb>] ? _raw_spin_unlock_irqrestore+0x5b/0x60
   [<ffffffff81250000>] ? irq_domain_set_hwirq_and_chip+0x30/0xb0
   [<ffffffff81256339>] ? debug_lockdep_rcu_enabled+0x39/0x40
   [<ffffffff812178c0>] ? __lock_is_held+0x90/0x130
   [<ffffffff81196877>] process_one_work+0x567/0xec0
  [...]

Afterwards, xhci reports some functional errors:

  xhci_hcd 0000:00:14.0: ERROR: unexpected setup address command completion
                                code 0x11.
  xhci_hcd 0000:00:14.0: ERROR: unexpected setup address command completion
                                code 0x11.
  usb 4-3: device not accepting address 2, error -22

Fix this by not overwriting the port1 argument in usb_alloc_dev(), but
storing the raw port number as required by OF in an additional variable,
raw_port.

Orabug: 27908746

Fixes: 69bec7259853 ("USB: core: let USB device know device node")
Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 7222c832254a75dcd67d683df75753d4a4e125bb)
Signed-off-by: Todd Vierling <todd.vierling@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ctf: drop the run-as-root error

Best practices continue to suggest running dwarf2ctf with nonroot
privileges, but established practice depends on being able to run
as root.

Orabug: 27852654
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>

rds: Node crashes when trace buffer is opened

The problem is that trace_printk() cannot handle a format string like
%p* which prints out the content from a pointer.  It stores the
pointer value.  When the trace file is opened, that pointer is
de-referenced and causes problems.  To use ftrace with such format
string, __trace_printk() needs to be used.  It stores the formatted
string in the trace buffer instead.

Orabug: 27846191

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Tom Hromatka <tom.hromatka@oracle.com>

xfs: fix accidental reversion of aa6a6227435cb

In commit id 7ea004358b97c ("xfs: don't leave EFIs on AIL on mount
failure") I accidentally reverted aa6a6227435cb06c ("xfs: toggle
readonly state around xfs_log_mount_finish"), which caused a regression
in generic/417. Put back the code that enables iunlink processing on a
ro mount.

Orabug: 27845869

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

net: cdc_ether: fix divide by 0 on bad descriptors

Orabug: 27841392
CVE: CVE-2017-16649

Setting dev->hard_mtu to 0 will cause a divide error in
usbnet_probe. Protect against devices with bogus CDC Ethernet
functional descriptors by ignoring a zero wMaxSegmentSize.

Signed-off-by: Bjørn Mork <bjorn@mork.no>
Acked-by: Oliver Neukum <oneukum@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/usb/cdc_ether.c
whitespace correction

Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

sysctl: Drop reference added by grab_header in proc_sys_readdir

Fixes CVE-2016-9191, proc_sys_readdir doesn't drop reference
added by grab_header when return from !dir_emit_dots path.
It can cause any path called unregister_sysctl_table will
wait forever.

The calltrace of CVE-2016-9191:

[ 5535.960522] Call Trace:
[ 5535.963265]  [<ffffffff817cdaaf>] schedule+0x3f/0xa0
[ 5535.968817]  [<ffffffff817d33fb>] schedule_timeout+0x3db/0x6f0
[ 5535.975346]  [<ffffffff817cf055>] ? wait_for_completion+0x45/0x130
[ 5535.982256]  [<ffffffff817cf0d3>] wait_for_completion+0xc3/0x130
[ 5535.988972]  [<ffffffff810d1fd0>] ? wake_up_q+0x80/0x80
[ 5535.994804]  [<ffffffff8130de64>] drop_sysctl_table+0xc4/0xe0
[ 5536.001227]  [<ffffffff8130de17>] drop_sysctl_table+0x77/0xe0
[ 5536.007648]  [<ffffffff8130decd>] unregister_sysctl_table+0x4d/0xa0
[ 5536.014654]  [<ffffffff8130deff>] unregister_sysctl_table+0x7f/0xa0
[ 5536.021657]  [<ffffffff810f57f5>] unregister_sched_domain_sysctl+0x15/0x40
[ 5536.029344]  [<ffffffff810d7704>] partition_sched_domains+0x44/0x450
[ 5536.036447]  [<ffffffff817d0761>] ? __mutex_unlock_slowpath+0x111/0x1f0
[ 5536.043844]  [<ffffffff81167684>] rebuild_sched_domains_locked+0x64/0xb0
[ 5536.051336]  [<ffffffff8116789d>] update_flag+0x11d/0x210
[ 5536.057373]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
[ 5536.064186]  [<ffffffff81167acb>] ? cpuset_css_offline+0x1b/0x60
[ 5536.070899]  [<ffffffff810fce3d>] ? trace_hardirqs_on+0xd/0x10
[ 5536.077420]  [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450
[ 5536.084234]  [<ffffffff8115a9f5>] ? css_killed_work_fn+0x25/0x220
[ 5536.091049]  [<ffffffff81167ae5>] cpuset_css_offline+0x35/0x60
[ 5536.097571]  [<ffffffff8115aa2c>] css_killed_work_fn+0x5c/0x220
[ 5536.104207]  [<ffffffff810bc83f>] process_one_work+0x1df/0x710
[ 5536.110736]  [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710
[ 5536.117461]  [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0
[ 5536.123697]  [<ffffffff810bcd70>] ? process_one_work+0x710/0x710
[ 5536.130426]  [<ffffffff810c3f7e>] kthread+0xfe/0x120
[ 5536.135991]  [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40
[ 5536.142041]  [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230

One cgroup maintainer mentioned that "cgroup is trying to offline
a cpuset css, which takes place under cgroup_mutex.  The offlining
ends up trying to drain active usages of a sysctl table which apprently
is not happening."
The real reason is that proc_sys_readdir doesn't drop reference added
by grab_header when return from !dir_emit_dots path. So this cpuset
offline path will wait here forever.

See here for details: http://www.openwall.com/lists/oss-security/2016/11/04/13

Fixes: f0c3b5093add ("[readdir] convert procfs")
Cc: stable@vger.kernel.org
Reported-by: CAI Qian <caiqian@redhat.com>
Tested-by: Yang Shukui <yangshukui@huawei.com>
Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
(cherry picked from commit 93362fa47fe98b62e4a34ab408c4a418432e7939)

Orabug: 27841944
CVE: CVE-2016-9191
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>

Revert "sysctl: Drop reference added by grab_header in proc_sys_readdir"

This reverts commit ae3ad84cc98a31bc65525115e61bd9f0c2334cc1. The commit
has a bogus CVE number which needs to be corrected. Simply changing the
metadata here.

Signed-off-by: Jack Vogel <jack.vogel@oracle.com>

xfs: remove "no-allocation" reservations for file creations

[ Upstream commit f59cf5c29919d17b61913c3360a7bd29b72975c1 ]

If we create a new file we will need an inode, and usually some metadata
in the parent direction.  Aiming for everything to go well despite the
lack of a reservation leads to dirty transactions cancelled under a heavy
create/delete load.  This patch removes those nospace transactions, which
will lead to slightly earlier ENOSPC on some workloads, but instead
prevent file system shutdowns due to cancelling dirty transactions for
others.

A customer could observe assertations failures and shutdowns due to
cancelation of dirty transactions during heavy NFS workloads as shown
below:

2017-05-30 21:17:06 kernel: WARNING: [ 2670.728125] XFS: Assertion failed: error != -ENOSPC, file: fs/xfs/xfs_inode.c, line: 1262

2017-05-30 21:17:06 kernel: WARNING: [ 2670.728222] Call Trace:
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728246]  [<ffffffff81795daf>] dump_stack+0x63/0x81
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728262]  [<ffffffff810a1a5a>] warn_slowpath_common+0x8a/0xc0
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728264]  [<ffffffff810a1b8a>] warn_slowpath_null+0x1a/0x20
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728285]  [<ffffffffa01bf403>] asswarn+0x33/0x40 [xfs]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728308]  [<ffffffffa01bb07e>] xfs_create+0x7be/0x7d0 [xfs]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728329]  [<ffffffffa01b6ffb>] xfs_generic_create+0x1fb/0x2e0 [xfs]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728348]  [<ffffffffa01b7114>] xfs_vn_mknod+0x14/0x20 [xfs]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728366]  [<ffffffffa01b7153>] xfs_vn_create+0x13/0x20 [xfs]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728380]  [<ffffffff81231de5>] vfs_create+0xd5/0x140
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728390]  [<ffffffffa045ddb9>] do_nfsd_create+0x499/0x610 [nfsd]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728396]  [<ffffffffa0465fa5>] nfsd3_proc_create+0x135/0x210 [nfsd]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728401]  [<ffffffffa04561e3>] nfsd_dispatch+0xc3/0x210 [nfsd]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728416]  [<ffffffffa03bfa43>] svc_process_common+0x453/0x6f0 [sunrpc]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728423]  [<ffffffffa03bfdf3>] svc_process+0x113/0x1f0 [sunrpc]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728427]  [<ffffffffa0455bcf>] nfsd+0x10f/0x180 [nfsd]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728432]  [<ffffffffa0455ac0>] ? nfsd_destroy+0x80/0x80 [nfsd]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728438]  [<ffffffff810c0d58>] kthread+0xd8/0xf0
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728441]  [<ffffffff810c0c80>] ? kthread_create_on_node+0x1b0/0x1b0
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728451]  [<ffffffff8179d962>] ret_from_fork+0x42/0x70
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728453]  [<ffffffff810c0c80>] ? kthread_create_on_node+0x1b0/0x1b0
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728454] ---[ end trace f9822c842fec81d4 ]---

2017-05-30 21:17:06 kernel: ALERT: [ 2670.728477] XFS (sdb): Internal error xfs_trans_cancel at line 983 of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x4ee/0x7d0 [xfs]

2017-05-30 21:17:06 kernel: ALERT: [ 2670.728684] XFS (sdb): Corruption of in-memory data detected. Shutting down filesystem
2017-05-30 21:17:06 kernel: ALERT: [ 2670.728685] XFS (sdb): Please umount the filesystem and rectify the problem(s)

Orabug: 27609439

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

xfs: don't print warnings when xfs_log_force fails

[ Upstream commit 84a4620cfe97c9d57e39b2369bfb77faff55063d ]

There are only two reasons for xfs_log_force / xfs_log_force_lsn to fail:
one is an I/O error, for which xlog_bdstrat already logs a warning, and
the second is an already shutdown log due to a previous I/O errors. In
the latter case we'll already have a previous indication for the actual
error, but the large stream of misleading warnings from xfs_log_force
will probably scroll it out of the message buffer.

Simply removing the warnings thus makes the XFS log reporting significantly
better.

Orabug: 27609404
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: wen.gang.wang@oracle.com
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

xfs: Properly retry failed dquot items in case of error during buffer writeback

[ Upstream commit 373b0589dc8d58bc09c9a28d03611ae4fb216057 ]

Once the inode item writeback errors is already fixed, it's time to fix the same
problem in dquot code.

Although there were no reports of users hitting this bug in dquot code (at least
none I've seen), the bug is there and I was already planning to fix it when the
correct approach to fix the inodes part was decided.

This patch aims to fix the same problem in dquot code, regarding failed buffers
being unable to be resubmitted once they are flush locked.

Tested with the recently test-case sent to fstests list by Hou Tao.

Orabug: 27609404
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: wen.gang.wang@oracle.com
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

xfs: Properly retry failed inode items in case of error during buffer writeback

[ Upstream commit d3a304b6292168b83b45d624784f973fdc1ca674 ]

When a buffer has been failed during writeback, the inode items into it
are kept flush locked, and are never resubmitted due the flush lock, so,
if any buffer fails to be written, the items in AIL are never written to
disk and never unlocked.

This causes unmount operation to hang due these items flush locked in AIL,
but this also causes the items in AIL to never be written back, even when
the IO device comes back to normal.

I've been testing this patch with a DM-thin device, creating a
filesystem larger than the real device.

When writing enough data to fill the DM-thin device, XFS receives ENOSPC
errors from the device, and keep spinning on xfsaild (when 'retry
forever' configuration is set).

At this point, the filesystem can not be unmounted because of the flush locked
items in AIL, but worse, the items in AIL are never retried at all
(once xfs_inode_item_push() will skip the items that are flush locked),
even if the underlying DM-thin device is expanded to the proper size.

This patch fixes both cases, retrying any item that has been failed
previously, using the infra-structure provided by the previous patch.

Orabug: 27609404
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: wen.gang.wang@oracle.com
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

xfs: Add infrastructure needed for error propagation during buffer IO failure

[ Upstream commit 0b80ae6ed13169bd3a244e71169f2cc020b0c57a ]

With the current code, XFS never re-submit a failed buffer for IO,
because the failed item in the buffer is kept in the flush locked state
forever.

To be able to resubmit an log item for IO, we need a way to mark an item
as failed, if, for any reason the buffer which the item belonged to
failed during writeback.

Add a new log item callback to be used after an IO completion failure
and make the needed clean ups.

Orabug: 27609404
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: wen.gang.wang@oracle.com
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

xfs: remove xfs_trans_ail_delete_bulk

[ Upstream commit 27af1bbf524459962d1477a38ac6e0b7f79aaecc ]

xfs_iflush_done uses an on-stack variable length array to pass the log
items to be deleted to xfs_trans_ail_delete_bulk. On-stack VLAs are a
nasty gcc extension that can lead to unbounded stack allocations, but
fortunately we can easily avoid them by simply open coding
xfs_trans_ail_delete_bulk in xfs_iflush_done, which is the only caller
of it except for the single-item xfs_trans_ail_delete.

Orabug: 27609404
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: wen.gang.wang@oracle.com
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

xfs: fix and streamline error handling in xfs_end_io

[ Upstream commit 787eb485509f9d58962bd8b4dbc6a5ac6e2034fe ]

There are two different cases of buffered I/O errors:

- first we can have an already shutdown fs.  In that case we should skip
   any on-disk operations and just clean up the appen transaction if
   present and destroy the ioend
- a real I/O error.  In that case we should cleanup any lingering COW
   blocks.  This gets skipped in the current code and is fixed by this
   patch.

Orabug: 27609404
Originally-Signed-off-by: Christoph Hellwig <hch@lst.de>
[darrick: heavily modified since we don't support cow in uek4...]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: wen.gang.wang@oracle.com
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

xfs: don't leave EFIs on AIL on mount failure

[ Upstream commit f0b2efad16e78623b5a156f6e4e9166907b83155 ]

Log recovery occurs in two phases at mount time. In the first phase,
EFIs and EFDs are processed and potentially cancelled out. EFIs without
EFD objects are inserted into the AIL for processing and recovery in the
second phase. xfs_mountfs() runs various other operations between the
phases and is thus subject to failure. If failure occurs after the first
phase but before the second, pending EFIs sit on the AIL, pin it and
cause the mount to hang.

Update the mount sequence to ensure that pending EFIs are cancelled in
the event of failure. Add a recovery cancellation mechanism to iterate
the AIL and cancel all EFI items when requested. Plumb cancellation
support through the log mount finish helper and update xfs_mountfs() to
invoke cancellation in the event of failure after recovery has started.

Orabug: 27609404
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: wen.gang.wang@oracle.com
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

xfs: use EFI refcount consistently in log recovery

[ Upstream commit e32a1d1fbf6eb2bdc24aa0502e827ff4d2234604 ]

The EFI is initialized with a reference count of 2. One for the EFI to
ensure the item makes it to the AIL and one for the subsequently created
EFD to release the EFI once the EFD is committed. Log recovery uses the
EFI in a similar manner, but implements a hack to remove both references
in one call once the EFD is handled.

Update log recovery to use EFI reference counting in a manner consistent
with the log. When an EFI is encountered during recovery, an EFI item is
allocated and inserted to the AIL directly. Since the EFI reference is
typically dropped when the EFI is unpinned and this is analogous with
AIL insertion, drop the EFI reference at this point.

When a corresponding EFD is encountered in the log, this indicates that
the extents were freed, no processing is required and the EFI can be
dropped. Update xlog_recover_efd_pass2() to simply drop the EFD
reference at this point rather than open code the AIL removal and EFI
free.

Remaining EFIs (i.e., with no corresponding EFD) are processed in
xlog_recover_finish(). An EFD transaction is allocated and the extents
are freed, which transfers ownership of the EFI reference to the EFD
item in the log.

Orabug: 27609404
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: wen.gang.wang@oracle.com
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

xfs: ensure EFD trans aborts on log recovery extent free failure

[ Upstream commit 6bc43af3d5f507254b8de2058ea51f6ec998ae52 ]

Log recovery attempts to free extents with leftover EFIs in the AIL
after initial processing. If the extent free fails (e.g., due to
unrelated fs corruption), the transaction is cancelled, though it
might not be dirtied at the time. If this is the case, the EFD does
not abort and thus does not release the EFI. This can lead to hangs
as the EFI pins the AIL.

Update xlog_recover_process_efi() to log the EFD in the transaction
before xfs_free_extent() errors are handled to ensure the
transaction is dirty, aborts the EFD and releases the EFI on error.
Since this is a requirement for EFD processing (and consistent with
xfs_bmap_finish()), update the EFD logging helper to do the extent
free and unconditionally log the EFD. This encodes the required EFD
logging behavior into the helper and reduces the likelihood of
errors down the road.

[dchinner: re-add xfs_alloc.h to xfs_log_recover.c to fix build
failure.]

Orabug: 27609404
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: wen.gang.wang@oracle.com
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

xfs: fix efi/efd error handling to avoid fs shutdown hangs

[ Upstream commit 8d99fe92fed019e203f458370129fb28b3fb5740 ]

Freeing an extent in XFS involves logging an EFI (extent free
intention), freeing the actual extent, and logging an EFD (extent
free done). The EFI object is created with a reference count of 2:
one for the current transaction and one for the subsequently created
EFD. Under normal circumstances, the first reference is dropped when
the EFI is unpinned and the second reference is dropped when the EFD
is committed to the on-disk log.

In event of errors or filesystem shutdown, there are various
potential cleanup scenarios depending on the state of the EFI/EFD.
The cleanup scenarios are confusing and racy, as demonstrated by the
following test sequence:

# mount $dev $mnt
# fsstress -d $mnt -n 99999 -p 16 -z -f fallocate=1 \
-f punch=1 -f creat=1 -f unlink=1 &
# sleep 5
# killall -9 fsstress; wait
# godown -f $mnt
# umount

... in which the final umount can hang due to the AIL being pinned
indefinitely by one or more EFI items. This can occur due to several
conditions. For example, if the shutdown occurs after the EFI is
committed to the on-disk log and the EFD committed to the CIL, but
before the EFD committed to the log, the EFD iop_committed() abort
handler does not drop its reference to the EFI. Alternatively,
manual error injection in the xfs_bmap_finish() codepath shows that
if an error occurs after the EFI transaction is committed but before
the EFD is constructed and logged, the EFI is never released from
the AIL.

Update the EFI/EFD item handling code to use a more straightforward
and reliable approach to error handling. If an error occurs after
the EFI transaction is committed and before the EFD is constructed,
release the EFI explicitly from xfs_bmap_finish(). If the EFI
transaction is cancelled, release the EFI in the unlock handler.

Once the EFD is constructed, it is responsible for releasing the EFI
under any circumstances (including whether the EFI item aborts due
to log I/O error). Update the EFD item handlers to release the EFI
if the transaction is cancelled or aborts due to log I/O error.
Finally, update xfs_bmap_finish() to log at least one EFD extent to
the transaction before xfs_free_extent() errors are handled to
ensure the transaction is dirty and EFD item error handling is
triggered.

Orabug: 27609404
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: wen.gang.wang@oracle.com
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

xfs: return committed status from xfs_trans_roll()

[ Upstream commit d43ac29be7a174f93a3d26cc1e68668fe86b782f ]

Some callers need to make error handling decisions based on whether
the current transaction successfully committed or not. Rename
xfs_trans_roll(), add a new parameter and provide a wrapper to
preserve existing callers.

Orabug: 27609404
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: wen.gang.wang@oracle.com
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

xfs: disentagle EFI release from the extent count

[ Upstream commit 5e4b5386a2c29429add601c8cfb45bb10d80c490 ]

Release of the EFI either occurs based on the reference count or the
extent count. The extent count used is either the count tracked in
the EFI or EFD, depending on the particular situation. In either
case, the count is initialized to the final value and thus always
matches the current efi_next_extent value once the EFI is completely
constructed. For example, the EFI extent count is increased as the
extents are logged in xfs_bmap_finish() and the full free list is
always completely processed. Therefore, the count is guaranteed to
be complete once the EFI transaction is committed. The EFD uses the
efd_nextents counter to release the EFI. This counter is initialized
to the count of the EFI when the EFD is created. Thus the EFD, as
currently used, has no concept of partial EFI release based on
extent count.

Given that the EFI extent count is always released in whole, use of
the extent count for reference counting is unnecessary. Remove this
level of the API and release the EFI based on the core reference
count. The efi_next_extent counter remains because it is still used
to track the slot to log the next extent to free.

Orabug: 27609404
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: wen.gang.wang@oracle.com
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

netfilter: ebtables: CONFIG_COMPAT: don't trust userland offsets

We need to make sure the offsets are not out of range of the
total size.
Also check that they are in ascending order.

The WARN_ON triggered by syzkaller (it sets panic_on_warn) is
changed to also bail out, no point in continuing parsing.

Briefly tested with simple ruleset of
-A INPUT --limit 1/s' --log
plus jump to custom chains using 32bit ebtables binary.

Reported-by: <syzbot+845a53d13171abf8bf29@syzkaller.appspotmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit b71812168571fa55e44cdd0254471331b9c4c4c6)

Orabug: 27774012
CVE: CVE-2018-1068

Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com>

ACPI / PAD: don't register acpi_pad driver if running as Xen dom0

When running as Xen dom0 a special processor_aggregator driver is
needed. Don't register the standard driver in this case.

Without that check an error message:

"Error: Driver 'processor_aggregator' is already registered,
aborting..."

will be displayed.

Signed-off-by: Juergen Gross <jgross@suse.com>
[ rjw: Minor fixups ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit e311404f7925f6879817ebf471651c0bb5935604)
Orabug: 27796473
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Tested-by: Sriharsha NS <sriharsha.ns@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

sched/fair: Fix typo in sync_throttle()

We should update cfs_rq->throttled_clock_task, not
pcfs_rq->throttle_clock_task.

The effects of this bug was probably occasionally erratic
group scheduling, particularly in cgroups-intense workloads.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
[ Added changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 55e16d30bd99 ("sched/fair: Rework throttle_count sync")
Link: http://lkml.kernel.org/r/1468050862-18864-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit b8922125e4790fa237a8a4204562ecf457ef54bb)

Orabug: 27787518

Signed-off-by: Gayatri Vasudevan <gayatri.vasudevan@oracle.com>
Reviewed-by: Mridula Shastry <mridula.c.shastry@oracle.com>

sched/fair: Do not announce throttled next buddy in dequeue_task_fair()

Hierarchy could be already throttled at this point. Throttled next
buddy could trigger a NULL pointer dereference in pick_next_task_fair().

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/146608183552.21905.15924473394414832071.stgit@buzz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 754bd598be9bbc953bc709a9e8ed7f3188bfb9d7)

Orabug: 27787518

Signed-off-by: Gayatri Vasudevan <gayatri.vasudevan@oracle.com>
Reviewed-by: Mridula Shastry <mridula.c.shastry@oracle.com>

sched/fair: Initialize and rework throttle_count for new task-groups

This patch is a combination of the following three patches from mainline:

094f469172e0 sched/fair: Initialize throttle_count for new task-groups lazily

Cgroup created inside throttled group must inherit current throttle_count.
Broken throttle_count allows to nominate throttled entries as a next buddy,
later this leads to null pointer dereference in pick_next_task_fair().

This patch initialize cfs_rq->throttle_count at first enqueue: laziness
allows to skip locking all rq at group creation. Lazy approach also allows
to skip full sub-tree scan at throttling hierarchy (not in this patch).

8663e24d56dc sched/fair: Reorder cgroup creation code

A future patch needs rq->lock held _after_ we link the task_group into
the hierarchy. In order to avoid taking every rq->lock twice, reorder
things a little and create online_fair_sched_group() to be called
after we link the task_group.

All this code is still ran from css_alloc() so css_online() isn't in
fact used for this.

55e16d30bd99 sched/fair: Rework throttle_count sync

Since we already take rq->lock when creating a cgroup, use it to also
sync the throttle_count and avoid the extra state and enqueue path
branch.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: bsegall@google.com
Cc: linux-kernel@vger.kernel.org
[ Fixed build warning. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The patches have been combined because applying them separately will
cause a KABI breakage and introduce a dummy function.

Orabug: 27787518

Conflicts:
kernel/sched/fair.c
kernel/sched/core.c
kernel/sched/sched.h

Signed-off-by: Gayatri Vasudevan <gayatri.vasudevan@oracle.com>
Reviewed-by: Mridula Shastry <mridula.c.shastry@oracle.com>

perf tools: Move syscall number fallbacks from perf-sys.h to tools/arch/x86/include/asm/

And remove the empty tools/arch/x86/include/asm/unistd_{32,64}.h files
introduced by eae7a755ee81 ("perf tools, x86: Build perf on older
user-space as well").

This way we get closer to mirroring the kernel for cases where __NR_
can't be found for some include path/_GNU_SOURCE/whatever scenario.

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Link: http://lkml.kernel.org/n/tip-kpj6m3mbjw82kg6krk2z529e@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
(cherry picked from commit cec07f53c398)

Orabug: 27240053
Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

crypto: FIPS - allow tests to be disabled in FIPS mode

In FIPS mode, additional restrictions may apply. If these restrictions
are violated, the kernel will panic(). This patch allows test vectors
for symmetric ciphers to be marked as to be skipped in FIPS mode.

Together with the patch, the XTS test vectors where the AES key is
identical to the tweak key is disabled in FIPS mode. This test vector
violates the FIPS requirement that both keys must be different.

Reported-by: Tapas Sarangi <TSarangi@trustwave.com>
Signed-off-by: Stephan Mueller <smueller@chronox.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 10faa8c0d6c3b22466f97713a9533824a2ea1c57)

Orabug: 27809271

Signed-off-by: John Haxby <john.haxby@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Conflicts:
crypto/testmgr.h

crypto: xts - consolidate sanity check for keys

The patch centralizes the XTS key check logic into the service function
xts_check_key which is invoked from the different XTS implementations.
With this, the XTS implementations in ARM, ARM64, PPC and S390 have now
a sanity check for the XTS keys similar to the other arches.

In addition, this service function received a check to ensure that the
key != the tweak key which is mandated by FIPS 140-2 IG A.9. As the
check is not present in the standards defining XTS, it is only enforced
in FIPS mode of the kernel.

Signed-off-by: Stephan Mueller <smueller@chronox.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 28856a9e52c7cac712af6c143de04766617535dc)

Orabug: 27809271

Signed-off-by: John Haxby <john.haxby@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

crypto: rng - Zero seed in crypto_rng_reset

If we allocate a seed on behalf ot the user in crypto_rng_reset,
we must ensure that it is zeroed afterwards or the RNG may be
compromised.

Reported-by: Stephan Mueller <smueller@chronox.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit b617b702da4e922277806f81c411d3051107d462)

Orabug: 27809271

Signed-off-by: John Haxby <john.haxby@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

enic: set IG desc cache flag in open

New adapter needs CMD_OPENF_IG_DESCCACHE flag to be set. If this flag is
not set, fw flushes the global IG desc cache. This flag is nop in older
adapter.

Also increment driver version

Signed-off-by: Govindarajulu Varadarajan <gvaradar@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 5de0c022f1b0bce073cb04dd69ed7982805e5763)
Orabug: 27587345
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

Drivers: hv: utils: fix crash when device is removed from host side

The crash is observed when a service is being disabled host side while
userspace daemon is connected to the device:

[   90.244859] general protection fault: 0000 [#1] SMP
...
[   90.800082] Call Trace:
[   90.800082]  [<ffffffff81187008>] __fput+0xc8/0x1f0
[   90.800082]  [<ffffffff8118716e>] ____fput+0xe/0x10
...
[   90.800082]  [<ffffffff81015278>] do_signal+0x28/0x580
[   90.800082]  [<ffffffff81086656>] ? finish_task_switch+0xa6/0x180
[   90.800082]  [<ffffffff81443ebf>] ? __schedule+0x28f/0x870
[   90.800082]  [<ffffffffa01ebbaa>] ? hvt_op_read+0x12a/0x140 [hv_utils]
...

The problem is that hvutil_transport_destroy() which does misc_deregister()
freeing the appropriate device is reachable by two paths: module unload
and from util_remove(). While module unload path is protected by .owner in
struct file_operations util_remove() path is not. Freeing the device while
someone holds an open fd for it is a show stopper.

In general, it is not possible to revoke an fd from all users so the only
way to solve the issue is to defer freeing the hvutil_transport structure.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 27426102
(cherry picked from commit 9420098adc50a88d4a441e0f92d54bfa7af44448)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Tim Tianyang Chen <tianyang.chen@oracle.com>

Drivers: hv: utils: introduce HVUTIL_TRANSPORT_DESTROY mode

When Hyper-V host asks us to remove some util driver by closing the
appropriate channel there is no easy way to force the current file
descriptor holder to hang up but we can start to respond -EBADF to all
operations asking it to exit gracefully.

As we're setting hvt->mode from two separate contexts now we need to use
a proper locking.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 27426102
(cherry picked from commit a15025660d4703a8b37290a14734cb4a84875770)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Tim Tianyang Chen <tianyang.chen@oracle.com>

Drivers: hv: utils: rename outmsg_lock

As a preparation to reusing outmsg_lock to protect test-and-set openrations
on 'mode' rename it the more general 'lock'.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 27426102
(cherry picked from commit a72f3a4ccff22de879a1f599210ecdd9bd483a43)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Tim Tianyang Chen <tianyang.chen@oracle.com>

Drivers: hv: utils: fix memory leak on on_msg() failure

inmsg should be freed in case of on_msg() failure to avoid memory leak.
Preserve the error code from on_msg().

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 27426102
(cherry picked from commit 1f75338b6fece2bbd42ac3623830c65e2df6e031)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Tim Tianyang Chen <tianyang.chen@oracle.com>

Drivers: hv: utils: use memdup_user in hvt_op_write

Use memdup_user to handle OOM.

Fixes: 14b50f80c32d ('Drivers: hv: util: introduce hv_utils_transport abstraction')
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 27426102
(cherry picked from commit b00359642c2427da89dc8f77daa2c9e8a84e6d76)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Tim Tianyang Chen <tianyang.chen@oracle.com>

hv: util: checking the wrong variable

We don't catch this allocation failure because there is a typo and we
check the wrong variable.

Fixes: 14b50f80c32d ('Drivers: hv: util: introduce hv_utils_transport abstraction')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 27426102
(cherry picked from commit 9dd6a06430c94299651d74b9ed5ca8396ab8ff1f)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Tim Tianyang Chen <tianyang.chen@oracle.com>

net/rds: Avoid copy overhead if send buff is full

Avoid copying the message from user-space if we already
know there's not enough space in the send buffer.

Change-Id: I5ddde7d41bbaeaf25f398c721d02babd3893b73f
Orabug: 27747165
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Reviewed-by: Avinash Repaka <avinash.repaka@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

ext4: fix ->put_link panic

Orabug: 27498770

The following panic was caught. Something wrong with the storage and io error
was returned, generic_readlink()->ext4_follow_link()->page_follow_link_light()
returned with NULL page and error link, then ext4_put_link() tried to free the
error link and panic.

[25144440.198756] device-mapper: snapshots: Invalidating snapshot: Unable to allocate exception.
[25144440.332969] Aborting journal on device dm-7-8.
[25144440.338462] Buffer I/O error on dev dm-7, logical block 3702784, lost sync page write
[25144440.342625] Buffer I/O error on dev dm-7, logical block 0, lost sync page write
[25144440.342627] EXT4-fs error (device dm-7): ext4_journal_check_start:56: Detected aborted journal
[25144440.342629] EXT4-fs (dm-7): Remounting filesystem read-only
[25144440.342630] EXT4-fs (dm-7): previous I/O error to superblock detected
[25144440.342634] Buffer I/O error on dev dm-7, logical block 0, lost sync page write
[25144440.390336] JBD2: Error -5 detected when updating journal superblock for dm-7-8.
[25144464.799979] Buffer I/O error on dev dm-7, logical block 1573499, lost async page write
[25144464.809517] Buffer I/O error on dev dm-7, logical block 1573501, lost async page write
[25144464.819048] Buffer I/O error on dev dm-7, logical block 1573659, lost async page write
[25144464.828669] Buffer I/O error on dev dm-7, logical block 1573660, lost async page write
[25144464.838207] Buffer I/O error on dev dm-7, logical block 1573662, lost async page write
[25144464.847798] Buffer I/O error on dev dm-7, logical block 1573675, lost async page write
[25144464.857326] Buffer I/O error on dev dm-7, logical block 1573677, lost async page write
[25144464.866848] Buffer I/O error on dev dm-7, logical block 1573696, lost async page write
[25144464.876383] Buffer I/O error on dev dm-7, logical block 1573698, lost async page write
[25144464.885903] Buffer I/O error on dev dm-7, logical block 1573703, lost async page write
[25144496.335355] ------------[ cut here ]------------
[25144496.341039] kernel BUG at mm/slub.c:3413!
[25144496.345997] invalid opcode: 0000 [#1] SMP
[25144496.351074] Modules linked in: dm_snapshot dm_bufio nfnetlink_queue nfnetlink_log nfnetlink iptable_filter ip_tables oracleacfs(PO) oracleadvm(PO) oracleoks(PO) mpt3sas scsi_transport_sas raid_class nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs fscache lockd sunrpc grace ipmi_poweroff ipmi_devintf bonding rds_rdma rds ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad mlx4_ib mlx4_core dm_multipath bnx2i cnic uio cxgb4i libcxgbi cxgb4 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipv6 fuse iTCO_wdt iTCO_vendor_support sb_edac edac_core i2c_i801 lpc_ich mfd_core sg ixgbe dca ptp pps_core vxlan udp_tunnel ip6_udp_tunnel mdio ipmi_ssif i2c_core ipmi_si ipmi_msghandler wmi acpi_pad ext4 jbd2 mbcache sd_mod ahci libahci megaraid_sas dm_mirror
[25144496.433281]  dm_region_hash dm_log dm_mod
[25144496.436514] CPU: 8 PID: 201734 Comm: tar Tainted: P           O    4.1.12-61.28.1.el6uek.x86_64 #2
[25144496.447204] Hardware name: Oracle Corporation ORACLE SERVER X6-2/ASM,MOTHERBOARD,1U, BIOS 38070000 12/16/2016
[25144496.458974] task: ffff888ce26d0e00 ti: ffff888da1c50000 task.ti: ffff888da1c50000
[25144496.467999] RIP: 0010:[<ffffffff811e5de9>]  [<ffffffff811e5de9>] kfree+0x159/0x170
[25144496.477238] RSP: 0018:ffff888da1c53da8  EFLAGS: 00010246
[25144496.483734] RAX: 001fffff80000400 RBX: fffffffffffffffb RCX: ffffffffa00ee860
[25144496.492480] RDX: 0000000000000000 RSI: fffffffffffffffb RDI: ffffea0001ffffc0
[25144496.501116] RBP: ffff888da1c53dc8 R08: 000000000001abc0 R09: ffff885efec07480
[25144496.509753] R10: ffffffff8118b1b7 R11: 0000000000000000 R12: 0000000000000000
[25144496.518425] R13: ffffffffa00ee890 R14: 00000000fffffffb R15: 000000000000005f
[25144496.527061] FS:  00007f949c3957a0(0000) GS:ffff885eff400000(0000) knlGS:0000000000000000
[25144496.536769] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25144496.543674] CR2: 00007f47cdbd8d20 CR3: 0000006093b7b000 CR4: 00000000003406e0
[25144496.552333] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[25144496.560961] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[25144496.569630] Stack:
[25144496.572452]  ffff888da1c53de8 ffff8801e33e7bc0 0000000000000000 ffff888da1c53de8
[25144496.581455]  ffff888da1c53dd8 ffffffffa00ee890 ffff888da1c53eb8 ffffffff8120f7ec
[25144496.590422]  ffff888da1c53eb8 ffffffff812144c3 ffff88bec2658320 ffff8801e33e7bc0
[25144496.599391] Call Trace:
[25144496.602619]  [<ffffffffa00ee890>] ext4_put_link+0x30/0x40 [ext4]
[25144496.609819]  [<ffffffff8120f7ec>] generic_readlink+0x8c/0xb0
[25144496.616627]  [<ffffffff812144c3>] ? user_path_at_empty+0x63/0xa0
[25144496.623816]  [<ffffffff8120a496>] SyS_readlinkat+0x116/0x130
[25144496.630629]  [<ffffffff8120a0eb>] SyS_readlink+0x1b/0x20
[25144496.637065]  [<ffffffff8169d76e>] system_call_fastpath+0x12/0x71
[25144496.644278] Code: ff ff eb 8e 66 f7 07 00 c0 74 20 48 8b 07 31 f6 f6 c4 40 74 03 8b 77 68 e8 f5 d7 fa ff e9 70 ff ff ff 48 8b 7f 30 e9 19 ff ff ff <0f> 0b 0f 1f 44 00 00 eb f9 66 66 66 66 66 2e 0f 1f 84 00 00 00
[25144496.667149] RIP  [<ffffffff811e5de9>] kfree+0x159/0x170
[25144496.673496]  RSP <ffff888da1c53da8>

Mainline/uek5 not have this issue, as the ->following_link and ->put_link have
been refactored there. The patche set to do that is a little big, so I don't
bother to backport them, just write this small patch to fix the issue.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>

KVM/VMX: Clear spec_ctrl status when resetting vcpu

vmx->spec_ctrl was not set to 0 in vmx_vcpu_reset, which could result in
IBRS getting stuck on all the time, even with 'spectre_v2=off' set. This
was most notable when rebooting from an older kernel into a newer
retpoline-enabled kernel resulted in up to 80% CPU performance drop.

OraBug: 27774415

Suggested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Patrick Colp <patrick.colp@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>