www.infradead.org Git - users/jedix/linux-maple.git/log

net: add netif_is_lag_master helper

Orabug: 29495360

Some code does not mind if the master is bond or team and treats them
the same, as generic LAG.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 7be61833042e7757745345eedc7b0efee240c189)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: add netif_is_team_port helper

Orabug: 29495360

Similar to other helpers, caller can use this to find out if device is
team port.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit f7f019ee6d117de5007d0b10e7960696bbf111eb)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: add netif_is_team_master helper

Orabug: 29495360

Similar to other helpers, caller can use this to find out if device is
team master.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c981e4213e9d2d4ec79501bd607722ec712742a2)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/team/team.c
include/linux/netdevice.h

scsi: scsi_transport_iscsi: redirect conn error to console

This commit changes "detected conn error" printk
log level from KERN_INFO to KERN_WARNING. This change
is made with the assumption that KERN_WARNING messages
are configured to be redirected to console. It is
particularly useful to have detected connection errors
redirected to console when using iscsi boot device as it
may give clues as to why the system appears to be hung.

Orabug: 29469714

Signed-off-by: Fred Herard <fred.herard@oracle.com>
Reviewed-by: Allen Pais <allen.pais@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

Revert x86/apic/x2apic: set affinity of a single interrupt to one cpu

The commit 092aa78c11f0
("x86/apic/x2apic: set affinity of a single interrupt to one cpu")
was causing performance regression on block storage server on X5.

On OCI X5 server, they were not binding irqs to CPUs 1:1,
irq to cpu affinity was set to multiple cpus
(/proc/$irq/smp_affinity: 00,003ffff0,0003ffff, cpu0-17 and 36-53).
This is not the default behavior of bnxt_en. From bnxt_en,
driver, when NIC link is up, it sets irq affinity, OCI assumed that
most of the bnxt_en interrupts will go to cpu3.

After the patch "x86/apic/x2apic:
set affinity of a single interrupt to one cpu",
if we set irq to cpu 1:1, it works fine, but if we set irq affinity
to multiple cpus, it only sets irq_cfg->domain/cpumask to the first
online cpu which is on the cpu affinity list. With the current setting
which caused the perf issue, although /proc/$irq/smp_affinity is set
to multiple cpus, irq_cfg->domain cpumask only has cpu 0, this lead all
ens4f0-TxRx interrupts to route to cpu0, also iscsi target application
was being run on CPU0 during the testing which led to the performace issue.
The issue is no longer seen after the patch was reverted.

Orabug: 29449976
Signed-off-by: Mridula Shastry <mridula.c.shastry@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

jiffies: use jiffies64_to_nsecs() to fix 100% steal usage for xen vcpu hotplug

[ Not relevant upstream, therefore no upstream commit. ]

To fix, use jiffies64_to_nsecs() directly instead of deriving the result
according to jiffies_to_usecs().

As the return type of jiffies_to_usecs() is 'unsigned int', when the return
value is more than the size of 'unsigned int', the leading 32 bits would be
discarded.

Suppose USEC_PER_SEC=1000000L and HZ=1000, below are the expected and
actual incorrect result of jiffies_to_usecs(0x7770ef70):

- expected  : jiffies_to_usecs(0x7770ef70) = 0x000001d291274d80
- incorrect : jiffies_to_usecs(0x7770ef70) = 0x0000000091274d80

The leading 0x000001d200000000 is discarded.

After xen vcpu hotplug and when the new vcpu steal clock is calculated for
the first time, the result of this_rq()->prev_steal_time in
steal_account_process_tick() would be far smaller than the expected
value, due to that jiffies_to_usecs() discards the leading 32 bits.

As a result, the diff between current steal and this_rq()->prev_steal_time
is always very large. Steal usage would become 100% when the initial steal
clock obtained from xen hypervisor is very large during xen vcpu hotplug,
that is, when the guest is already up for a long time.

The bug can be detected by doing the following:

* Boot xen guest with vcpus=2 and maxvcpus=4
* Leave the guest running for a month so that the initial steal clock for
  the new vcpu would be very large
* Hotplug 2 extra vcpus
* The steal time of new vcpus in /proc/stat would increase abnormally and
  sometimes steal usage in top can become 100%

This was incidentally fixed in the patch set starting by
commit 93825f2ec736 ("jiffies: Reuse TICK_NSEC instead of NSEC_PER_JIFFY")
and ended with
commit b672592f0221 ("sched/cputime: Remove generic asm headers").

Orabug: 28806208

Link: https://lkml.org/lkml/2019/2/28/1373
Suggested-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

time: Introduce jiffies64_to_nsecs()

This will be needed for the cputime_t to nsec conversion.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 07e5f5e353aaa61696c8353d87050994a0c4648a)

Orabug: 28806208

This backport makes jiffies64_to_nsecs() available for the next patch.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net_failover: delay taking over primary device to accommodate udevd renaming

There is an inherent race with udev rename in userspace due to the exposure
of two lower slave devices while kernel attempts to manage the creation
for failover bonding itself automatically. The existing userspace naming
logic in udevd was not specifically written for this in-kernel
automagic.

The clean fix for the problem is either to update the udevd to not try
rename the 3-netdev (ideally rename the device in a coordinated manner),
or to fix the kernel to hide the 2 lower devices which does not have to
be shown to userspace unless needed (1-netdev model).

However, our pursuance of 1-netdev model had not been acknowledged by
upstream, and there's no motivation in the systemd/udevd community at
this point to refactor the rename logic and make it work well with
3-netdev.

Hyper-V's netvsc mitigated this by postponing the VF's dev_open() to
allow a userspace thread to rename the device within a 100ms worth of
window. For the interim, we follow the same as done by netvsc to avoid
the renaming failure, until we move to the point where a clean solution
is available in upstream.

OraBug: 29281273

Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
Acked-by: John Haxby <john.haxby@oracle.com>
Reviewed-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Tested-by: Vijay Balakrishna <vijay.balakrishna@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

fs/dcache.c: add cond_resched() in shrink_dentry_list()

Orabug: 29412146

As previously reported (https://patchwork.kernel.org/patch/8642031/)
it's possible to call shrink_dentry_list with a large number of dentries
(> 10000).  This, in turn, could trigger the softlockup detector and
possibly trigger a panic.  In addition to the unmount path being
vulnerable to this scenario, at SuSE we've observed similar situation
happening during process exit on processes that touch a lot of dentries.
Here is an excerpt from a crash dump.  The number after the colon are
the number of dentries on the list passed to shrink_dentry_list:

PID 99760: 10722
PID 107530: 215
PID 108809: 24134
PID 108877: 21331
PID 141708: 16487

So we want to kill between 15k-25k dentries without yielding.

And one possible call stack looks like:

4 [ffff8839ece41db0] _raw_spin_lock at ffffffff8152a5f8
5 [ffff8839ece41db0] evict at ffffffff811c3026
6 [ffff8839ece41dd0] __dentry_kill at ffffffff811bf258
7 [ffff8839ece41df0] shrink_dentry_list at ffffffff811bf593
8 [ffff8839ece41e18] shrink_dcache_parent at ffffffff811bf830
9 [ffff8839ece41e50] proc_flush_task at ffffffff8120dd61
10 [ffff8839ece41ec0] release_task at ffffffff81059ebd
11 [ffff8839ece41f08] do_exit at ffffffff8105b8ce
12 [ffff8839ece41f78] sys_exit at ffffffff8105bd53
13 [ffff8839ece41f80] system_call_fastpath at ffffffff81532909

While some of the callers of shrink_dentry_list do use cond_resched,
this is not sufficient to prevent softlockups.  So just move
cond_resched into shrink_dentry_list from its callers.

David said: I've found hundreds of occurrences of warnings that we emit
when need_resched stays set for a prolonged period of time with the
stack trace that is included in the change log.

Link: http://lkml.kernel.org/r/1521718946-31521-1-git-send-email-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 32785c0539b7e96f77a14a4f4ab225712665a5a4)
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Conflicts:
fs/dcache.c
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

NFS: commit direct writes even if they fail partially

If some of the WRITE calls making up an O_DIRECT write syscall fail,
we neglect to commit, even if some of the WRITEs succeed.

We also depend on the commit code to free the reference count on the
nfs_page taken in the "if (request_commit)" case at the end of
nfs_direct_write_completion(). The problem was originally noticed
because ENOSPC's encountered partway through a write would result in a
closed file being sillyrenamed when it should have been unlinked.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
(cherry picked from commit 1b8d97b0a837beaf48a8449955b52c650a7114b4)

Orabug: 28212440
Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>
Reviewed-by: Calum Mackay <calum.mackay@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

rds: update correct congestion map for loopback transport

Loopback transport delivers data directly to the destination
socket since destination is always local to the machine.
The connection data structure passed to an internal
initializer API rds_inc_init() is from send side (unlike the
usual call to APIs from receive path when receive side
connection data structure is passed).

This inconsistency causes an update of the incorrect
congestion map when marking destination port congested when
loopback transport is used (which is when one or both end(s)
of a RDS connection has an IP loopback address).

The fix it to ensure correct map is updated, that of the
destination IP regardless of delivery coming from send side
of loopback transport or receive side of other transports.

Orabug: 29175685

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: only look at the bg_flags field if it is valid

The bg_flags field in the block group descripts is only valid if the
uninit_bg or metadata_csum feature is enabled. We were not
consistently looking at this field; fix this.

Also block group #0 must never have uninitialized allocation bitmaps,
or need to be zeroed, since that's where the root inode, and other
special inodes are set up. Check for these conditions and mark the
file system as corrupted if they are detected.

This addresses CVE-2018-10876.

https://bugzilla.kernel.org/show_bug.cgi?id=199403

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
(cherry picked from commit 8844618d8aa7a9973e7b527d038a2a589665002c)

Orabug: 29316684
CVE: CVE-2018-10876.

Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com>
Reviewed-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/ext4/balloc.c
fs/ext4/ialloc.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

uek-rpm: Add kernel-uek version to kernel-ueknano provides

Orabug: 29357643

kernel-ueknano package provides kernel-uek with no version in it. So it
can match any kernel-uek version installed. This commit adds the version
that the rpm is built for in kernel-uek provides also.

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: Set sk_prot_creator when cloning sockets to the right proto

sk->sk_prot and sk->sk_prot_creator can differ when the app uses
IPV6_ADDRFORM (transforming an IPv6-socket to an IPv4-one).
Which is why sk_prot_creator is there to make sure that sk_prot_free()
does the kmem_cache_free() on the right kmem_cache slab.

Now, if such a socket gets transformed back to a listening socket (using
connect() with AF_UNSPEC) we will allocate an IPv4 tcp_sock through
sk_clone_lock() when a new connection comes in. But sk_prot_creator will
still point to the IPv6 kmem_cache (as everything got copied in
sk_clone_lock()). When freeing, we will thus put this
memory back into the IPv6 kmem_cache although it was allocated in the
IPv4 cache. I have seen memory corruption happening because of this.

With slub-debugging and MEMCG_KMEM enabled this gives the warning
"cache_from_obj: Wrong slab cache. TCPv6 but object is from TCP"

A C-program to trigger this:

void main(void)
{
        int fd = socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP);
        int new_fd, newest_fd, client_fd;
        struct sockaddr_in6 bind_addr;
        struct sockaddr_in bind_addr4, client_addr1, client_addr2;
        struct sockaddr unsp;
        int val;

        memset(&bind_addr, 0, sizeof(bind_addr));
        bind_addr.sin6_family = AF_INET6;
        bind_addr.sin6_port = ntohs(42424);

        memset(&client_addr1, 0, sizeof(client_addr1));
        client_addr1.sin_family = AF_INET;
        client_addr1.sin_port = ntohs(42424);
        client_addr1.sin_addr.s_addr = inet_addr("127.0.0.1");

        memset(&client_addr2, 0, sizeof(client_addr2));
        client_addr2.sin_family = AF_INET;
        client_addr2.sin_port = ntohs(42421);
        client_addr2.sin_addr.s_addr = inet_addr("127.0.0.1");

        memset(&unsp, 0, sizeof(unsp));
        unsp.sa_family = AF_UNSPEC;

        bind(fd, (struct sockaddr *)&bind_addr, sizeof(bind_addr));

        listen(fd, 5);

        client_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
        connect(client_fd, (struct sockaddr *)&client_addr1, sizeof(client_addr1));
        new_fd = accept(fd, NULL, NULL);
        close(fd);

        val = AF_INET;
        setsockopt(new_fd, SOL_IPV6, IPV6_ADDRFORM, &val, sizeof(val));

        connect(new_fd, &unsp, sizeof(unsp));

        memset(&bind_addr4, 0, sizeof(bind_addr4));
        bind_addr4.sin_family = AF_INET;
        bind_addr4.sin_port = ntohs(42421);
        bind(new_fd, (struct sockaddr *)&bind_addr4, sizeof(bind_addr4));

        listen(new_fd, 5);

        client_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
        connect(client_fd, (struct sockaddr *)&client_addr2, sizeof(client_addr2));

        newest_fd = accept(new_fd, NULL, NULL);
        close(new_fd);

        close(client_fd);
        close(new_fd);
}

As far as I can see, this bug has been there since the beginning of the
git-days.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 9d538fa60bad4f7b23193c89e843797a1cf71ef3)

Orabug: 29422739
CVE: CVE-2018-9568

Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: always check block group bounds in ext4_init_block_bitmap()

commit 819b23f1c501b17b9694325471789e6b5cc2d0d2 upstream.

Regardless of whether the flex_bg feature is set, we should always
check to make sure the bits we are setting in the block bitmap are
within the block group bounds.

https://bugzilla.kernel.org/show_bug.cgi?id=199865

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit ac48bb9bc0a32f5a4432be1645b57607f8c46aa7)

Orabug: 29428607
CVE: CVE-2018-10878

Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: make sure bitmaps and the inode table don't overlap with bg descriptors

commit 77260807d1170a8cf35dbb06e07461a655f67eee upstream.

It's really bad when the allocation bitmaps and the inode table
overlap with the block group descriptors, since it causes random
corruption of the bg descriptors. So we really want to head those off
at the pass.

https://bugzilla.kernel.org/show_bug.cgi?id=199865

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit ac93c718365ac6ea9d7631641c8dec867d623491)

Orabug: 29428607
CVE: CVE-2018-10878

Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

vfs: Add sb_rdonly(sb) to query the MS_RDONLY flag on s_flags

Add an sb_rdonly() function to query the MS_RDONLY flag on sb->s_flags
preparatory to providing an SB_RDONLY flag.

Signed-off-by: David Howells <dhowells@redhat.com>
(cherry picked from commit 94e92e7ac90d06e1e839e112d3ae80b2457dbdd7)

Orabug: 29428607
CVE: CVE-2018-10878

Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

iscsi: Capture iscsi debug messages using tracepoints

This commit enhances iscsi initiator modules to capture iscsi debug messages
using linux kernel tracepoint facility:

https://www.kernel.org/doc/Documentation/trace/tracepoints.txt

The following tracepoint events have been created under the iscsi tracepoint
event group:

iscsi_dbg_conn - to capture connection debug messages (libiscsi module)
iscsi_dbg_session - to capture session debug messages (libiscsi module)
iscsi_dbg_eh - to capture error handling debug messages (libiscsi module)
iscsi_dbg_tcp - to capture iscsi tcp debug messages (libiscsi_tcp module)
iscsi_dbg_sw_tcp - to capture iscsi sw tcp debug messages (iscsi_tcp module)
iscsi_dbg_trans_session - to cpature iscsi trasnsport sess debug messages
(scsi_transport_iscsi module)
iscsi_dbg_trans_conn - to capture iscsi tansport conn debug messages
(scsi_transport_iscsi module)

Orabug: 29429855

Signed-off-by: Fred Herard <fred.herard@oracle.com>
Reviewed-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

KEYS: add missing permission check for request_key() destination

When the request_key() syscall is not passed a destination keyring, it
links the requested key (if constructed) into the "default" request-key
keyring.  This should require Write permission to the keyring.  However,
there is actually no permission check.

This can be abused to add keys to any keyring to which only Search
permission is granted.  This is because Search permission allows joining
the keyring.  keyctl_set_reqkey_keyring(KEY_REQKEY_DEFL_SESSION_KEYRING)
then will set the default request-key keyring to the session keyring.
Then, request_key() can be used to add keys to the keyring.

Both negatively and positively instantiated keys can be added using this
method.  Adding negative keys is trivial.  Adding a positive key is a
bit trickier.  It requires that either /sbin/request-key positively
instantiates the key, or that another thread adds the key to the process
keyring at just the right time, such that request_key() misses it
initially but then finds it in construct_alloc_key().

Fix this bug by checking for Write permission to the keyring in
construct_get_dest_keyring() when the default keyring is being used.

We don't do the permission check for non-default keyrings because that
was already done by the earlier call to lookup_user_key().  Also,
request_key_and_link() is currently passed a 'struct key *' rather than
a key_ref_t, so the "possessed" bit is unavailable.

We also don't do the permission check for the "requestor keyring", to
continue to support the use case described by commit 8bbf4976b59f
("KEYS: Alter use of key instantiation link-to-keyring argument") where
/sbin/request-key recursively calls request_key() to add keys to the
original requestor's destination keyring.  (I don't know of any users
who actually do that, though...)

Fixes: 3e30148c3d52 ("[PATCH] Keys: Make request-key create an authorisation key")
Cc: <stable@vger.kernel.org> # v2.6.13+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
(cherry picked from commit 4dca6ea1d9432052afb06baf2e3ae78188a4410b)

Orabug: 29304551
CVE: CVE-2017-17807

Reviewed-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
security/keys/request_key.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

KEYS: Don't permit request_key() to construct a new keyring

If request_key() is used to find a keyring, only do the search part - don't
do the construction part if the keyring was not found by the search. We
don't really want keyrings in the negative instantiated state since the
rejected/negative instantiation error value in the payload is unioned with
keyring metadata.

Now the kernel gives an error:

request_key("keyring", "#selinux,bdekeyring", "keyring", KEY_SPEC_USER_SESSION_KEYRING) = -1 EPERM (Operation not permitted)

Signed-off-by: David Howells <dhowells@redhat.com>
(cherry picked from commit 911b79cde95c7da0ec02f48105358a36636b7a71)

Orabug: 29304551
CVE: CVE-2017-17807

Reviewed-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mlx4_ib: Distribute completion vectors when zero is supplied

MAD packet sending/receiving is not properly virtualized in
CX-3. Hence, these are proxied through the PF driver. The proxying
uses UD QPs. The associated CQs are created with completion vector
zero, in anticipation that zero will return the least-used vector, as
per commit 6ba1eb776461 ("IB/mlx4: Scatter CQs to different EQs").

However, this does not happen, and we see that only the first EQ is
used for these proxy QPs.

This leads to great imbalance in CPU processing, in particular during
fail-over and fail-back, when a large number of RDMA CM
requests/responses are proxied through the PF driver.

Orabug: 29318191

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Sudhakar Dindukurti <sudhakar.dindukurti@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Fix TX timeout during netpoll.

Orabug: 29357977

The current netpoll implementation in the bnxt_en driver has problems
that may miss TX completion events. bnxt_poll_work() in effect is
only handling at most 1 TX packet before exiting. In addition,
there may be in flight TX completions that ->poll() may miss even
after we fix bnxt_poll_work() to handle all visible TX completions.
netpoll may not call ->poll() again and HW may not generate IRQ
because the driver does not ARM the IRQ when the budget (0 for netpoll)
is reached.

We fix it by handling all TX completions and to always ARM the IRQ
when we exit ->poll() with 0 budget.

Also, the logic to ACK the completion ring in case it is almost filled
with TX completions need to be adjusted to take care of the 0 budget
case, as discussed with Eric Dumazet <edumazet@google.com>

Reported-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Song Liu <songliubraving@fb.com>
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 73f21c653f930f438d53eed29b5e4c65c8a0f906)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Fix for system hang if request_irq fails

Orabug: 29357977

Fix bug in the error code path when bnxt_request_irq() returns failure.
bnxt_disable_napi() should not be called in this error path because
NAPI has not been enabled yet.

Fixes: c0c050c58d84 ("bnxt_en: New Broadcom ethernet driver.")
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c58387ab1614f6d7fb9e244f214b61e7631421fc)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Fix firmware message delay loop regression.

Orabug: 29357977

A recent change to reduce delay granularity waiting for firmware
reponse has caused a regression.  With a tighter delay loop,
the driver may see the beginning part of the response faster.
The original 5 usec delay to wait for the rest of the message
is not long enough and some messages are detected as invalid.

Increase the maximum wait time from 5 usec to 20 usec.  Also, fix
the debug message that shows the total delay time for the response
when the message times out.  With the new logic, the delay time
is not fixed per iteration of the loop, so we define a macro to
show the total delay time.

Fixes: 9751e8e71487 ("bnxt_en: reduce timeout on initial HWRM calls")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit cc559c1ac250a6025bd4a9528e424b8da250655b)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.c

bnxt_en: reduce timeout on initial HWRM calls

Orabug: 29357977

Testing with DIM enabled on older kernels indicated that firmware calls
were slower than expected. More detailed analysis indicated that the
default 25us delay was higher than necessary. Reducing the time spend in
usleep_range() for the first several calls would reduce the overall
latency of firmware calls on newer Intel processors.

Signed-off-by: Andy Gospodarek <gospo@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 9751e8e714872aa650b030e52a9fafbb694a3714)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.c

bnxt_en: Fix NULL pointer dereference at bnxt_free_irq().

Orabug: 29357977

When open fails during ethtool -L ring change, for example, the driver
may crash at bnxt_free_irq() because bp->bnapi is NULL.

If we fail to allocate all the new rings, bnxt_open_nic() will free
all the memory including bp->bnapi. Subsequent call to bnxt_close_nic()
will try to dereference bp->bnapi in bnxt_free_irq().

Fix it by checking for !bp->bnapi in bnxt_free_irq().

Fixes: e5811b8c09df ("bnxt_en: Add IRQ remapping logic.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit cb98526bf9b985866d648dbb9c983ba9eb59daba)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Check valid VNIC ID in bnxt_hwrm_vnic_set_tpa().

Orabug: 29357977

During initialization, if we encounter errors, there is a code path that
calls bnxt_hwrm_vnic_set_tpa() with invalid VNIC ID. This may cause a
warning in firmware logs.

Fixes: c0c050c58d84 ("bnxt_en: New Broadcom ethernet driver.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 3c4fe80b32c685bdc02b280814d0cfe80d441c72)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Do not modify max IRQ count after RDMA driver requests/frees IRQs.

Orabug: 29357977

Calling bnxt_set_max_func_irqs() to modify the max IRQ count requested or
freed by the RDMA driver is flawed.  The max IRQ count is checked when
re-initializing the IRQ vectors and this can happen multiple times
during ifup or ethtool -L.  If the max IRQ is reduced and the RDMA
driver is operational, we may not initailize IRQs correctly.  This
problem shows up on VFs with very small number of MSIX.

There is no other logic that relies on the IRQ count excluding the ones
used by RDMA.  So we fix it by just removing the call to subtract or
add the IRQs used by RDMA.

Fixes: a588e4580a7e ("bnxt_en: Add interface to support RDMA driver.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 30f529473ec962102e8bcd33a6a04f1e1b490ae2)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.h
drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c

mm: cleancache: fix corruption on missed inode invalidation

commit 6ff38bd40230af35e446239396e5fc8ebd6a5248 upstream.

If all pages are deleted from the mapping by memory reclaim and also
moved to the cleancache:

__delete_from_page_cache
  (no shadow case)
  unaccount_page_cache_page
    cleancache_put_page
  page_cache_delete
    mapping->nrpages -= nr
    (nrpages becomes 0)

We don't clean the cleancache for an inode after final file truncation
(removal).

truncate_inode_pages_final
  check (nrpages || nrexceptional) is false
    no truncate_inode_pages
      no cleancache_invalidate_inode(mapping)

These way when reading the new file created with same inode we may get
these trash leftover pages from cleancache and see wrong data instead of
the contents of the new file.

Fix it by always doing truncate_inode_pages which is already ready for
nrpages == 0 && nrexceptional == 0 case and just invalidates inode.

[akpm@linux-foundation.org: add comment, per Jan]
Link: http://lkml.kernel.org/r/20181112095734.17979-1-ptikhomirov@virtuozzo.com
Fixes: commit 91b0abe36a7b ("mm + fs: store shadow entries in page cache")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 29364670
CVE: CVE-2018-16862

Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

l2tp: fix reading optional fields of L2TPv3

Use pskb_may_pull() to make sure the optional fields are in skb linear
parts, so we can safely read them later.

It's easy to reproduce the issue with a net driver that supports paged
skb data. Just create a L2TPv3 over IP tunnel and then generates some
network traffic.
Once reproduced, rx err in /sys/kernel/debug/l2tp/tunnels will increase.

Changes in v4:
1. s/l2tp_v3_pull_opt/l2tp_v3_ensure_opt_in_linear/
2. s/tunnel->version != L2TP_HDR_VER_2/tunnel->version == L2TP_HDR_VER_3/
3. Add 'Fixes' in commit messages.

Changes in v3:
1. To keep consistency, move the code out of l2tp_recv_common.
2. Use "net" instead of "net-next", since this is a bug fix.

Changes in v2:
1. Only fix L2TPv3 to make code simple.
   To fix both L2TPv3 and L2TPv2, we'd better refactor l2tp_recv_common.
   It's complicated to do so.
2. Reloading pointers after pskb_may_pull

Fixes: f7faffa3ff8e ("l2tp: Add L2TPv3 protocol support")
Fixes: 0d76751fad77 ("l2tp: Add L2TPv3 IP encapsulation (no UDP) support")
Fixes: a32e0eec7042 ("l2tp: introduce L2TPv3 IP encapsulation support for IPv6")
Signed-off-by: Jacob Wen <jian.w.wen@oracle.com>
Acked-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 4522a70db7aa5e77526a4079628578599821b193)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
net/l2tp/l2tp_core.c
net/l2tp/l2tp_ip.c
net/l2tp/l2tp_ip6.c
Commit 2b139e6b1ec8 ("l2tp: remove ->recv_payload_hook") is not in UEK5.

l2tp_core.h:
    s/l2tp_get_l2specific_len(session)/session->l2specific_len/ due to
    62e7b6a57c7b ("l2tp: remove l2specific_len dependency in l2tp_core")
    is not in UEK4.

Orabug: 29368048

Signed-off-by: Jacob Wen <jian.w.wen@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net/packet: fix a race in packet_bind() and packet_notifier()

[ Upstream commit 15fe076edea787807a7cdc168df832544b58eba6 ]

syzbot reported crashes [1] and provided a C repro easing bug hunting.

When/if packet_do_bind() calls __unregister_prot_hook() and releases
po->bind_lock, another thread can run packet_notifier() and process an
NETDEV_UP event.

This calls register_prot_hook() and hooks again the socket right before
first thread is able to grab again po->bind_lock.

Fixes this issue by temporarily setting po->num to 0, as suggested by
David Miller.

[1]
dev_remove_pack: ffff8801bf16fa80 not found
------------[ cut here ]------------
kernel BUG at net/core/dev.c:7945!  ( BUG_ON(!list_empty(&dev->ptype_all)); )
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
device syz0 entered promiscuous mode
CPU: 0 PID: 3161 Comm: syzkaller404108 Not tainted 4.14.0+ #190
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
task: ffff8801cc57a500 task.stack: ffff8801cc588000
RIP: 0010:netdev_run_todo+0x772/0xae0 net/core/dev.c:7945
RSP: 0018:ffff8801cc58f598 EFLAGS: 00010293
RAX: ffff8801cc57a500 RBX: dffffc0000000000 RCX: ffffffff841f75b2
RDX: 0000000000000000 RSI: 1ffff100398b1ede RDI: ffff8801bf1f8810
device syz0 entered promiscuous mode
RBP: ffff8801cc58f898 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801bf1f8cd8
R13: ffff8801cc58f870 R14: ffff8801bf1f8780 R15: ffff8801cc58f7f0
FS:  0000000001716880(0000) GS:ffff8801db400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020b13000 CR3: 0000000005e25000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
rtnl_unlock+0xe/0x10 net/core/rtnetlink.c:106
tun_detach drivers/net/tun.c:670 [inline]
tun_chr_close+0x49/0x60 drivers/net/tun.c:2845
__fput+0x333/0x7f0 fs/file_table.c:210
____fput+0x15/0x20 fs/file_table.c:244
task_work_run+0x199/0x270 kernel/task_work.c:113
exit_task_work include/linux/task_work.h:22 [inline]
do_exit+0x9bb/0x1ae0 kernel/exit.c:865
do_group_exit+0x149/0x400 kernel/exit.c:968
SYSC_exit_group kernel/exit.c:979 [inline]
SyS_exit_group+0x1d/0x20 kernel/exit.c:977
entry_SYSCALL_64_fastpath+0x1f/0x96
RIP: 0033:0x44ad19

Fixes: 30f7ea1c2b5f ("packet: race condition in packet_bind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Francesco Ruggeri <fruggeri@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
(cherry picked from commit 72b01bee76591d3741a87a689bf210d729d530f8)

Orabug: 29385593
CVE: CVE-2018-18559

Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: verify the depth of extent tree in ext4_find_extent()

commit bc890a60247171294acc0bd67d211fa4b88d40ba upstream.

If there is a corupted file system where the claimed depth of the
extent tree is -1, this can cause a massive buffer overrun leading to
sadness.

This addresses CVE-2018-10877.

https://bugzilla.kernel.org/show_bug.cgi?id=199417

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit d69a9df614fc68741efcb0fcc020f05caa99d668)

Orabug: 29396712
CVE:CVE-2018-10877

Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

blk-mq: Do not invoke .queue_rq() for a stopped queue

The meaning of the BLK_MQ_S_STOPPED flag is "do not call
.queue_rq()". Hence modify blk_mq_make_request() such that requests
are queued instead of issued if a queue has been stopped.

Reported-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Orabug: 28766011

commit bc27c01b5c46d3bfec42c96537c7a3fae0bb2cc4 upstream

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
  - There are so many commits between the most recent uek4 block commit and
    this upstream commit, and blk_mq_make_request() in upstream commit is
    different with uek4. blk_mq_direct_issue_request() is not available.
  - The 3rd argument of blk_mq_insert_request() is set to false in this
    backport because there is no need to run the queue again when it is
    already stopped.

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

uek-rpm: use multi-threaded xz compression for rpms

By default kernel rpms use xz compression for all rpms files and
gzip compression for src.rpms. This commit changes compression type to xz
for all rpms produced, and enables multi-threaded compression for rpms. It
allows to use as many threads as there are CPU cores available.

Orabug: 29323635

Signed-off-by: Alex Burmashev <alexander.burmashev@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

uek-rpm: optimize find-requires usage

There are no deps found for debuginfo-common, doc and headers subpackage, so
Autoreq: no is now set for them, also switch from /usr/lib/rpm/redhat/find-requires
usage to /usr/lib/rpm/rpmdeps --requires, since latter is faster and less buggy.
Both changes noticeably boost kernel rpm build time.

Orabug: 29323635

Signed-off-by: Alex Burmashev <alexander.burmashev@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

find-debuginfo.sh: backport parallel files procession

Use bundled find-debuginfo.sh instead of copying one during the build

rpm upstream commit 038bfe01796f751001e02de41c5d8678f511f366

find-debuginfo.sh: Split directory traversal and debuginfo extraction

This siplifies the handling of hardlinks a bit and allows a later patch
to parallelize the debuginfo extraction.

Signed-off-by: Michal Marek <mmarek@suse.com>
rpm upstream commit 1b338aa84d4c67fefa957352a028eaca1a45d1f6

find-debuginfo.sh: Process files in parallel

Add a -j <n> option, which, when used, will spawn <n> processes to do the
debuginfo extraction in parallel. A pipe is used to dispatch the files among
the processes.

Signed-off-by: Michal Marek <mmarek@suse.com>
Orabug: 29323635

Signed-off-by: Alex Burmashev <alexander.burmashev@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Todd Vierling <todd.vierling@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

KVM: SVM: Add MSR-based feature support for serializing LFENCE

In order to determine if LFENCE is a serializing instruction on AMD
processors, MSR 0xc0011029 (MSR_F10H_DECFG) must be read and the state
of bit 1 checked.  This patch will add support to allow a guest to
properly make this determination.

Add the MSR feature callback operation to svm.c and add MSR 0xc0011029
to the list of MSR-based features.  If LFENCE is serializing, then the
feature is supported, allowing the hypervisor to set the value of the
MSR that guest will see.  Support is also added to write (hypervisor only)
and read the MSR value for the guest.  A write by the guest will result in
a #GP.  A read by the guest will return the value as set by the host.  In
this way, the support to expose the feature to the guest is controlled by
the hypervisor.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
(cherry picked from commit d1d93fa90f1afa926cb060b7f78ab01a65705b4d)

Orabug: 29335274

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/svm.c
arch/x86/kvm/x86.c
Contextual

Signed-off-by: Brian Maly <brian.maly@oracle.com>

Enable RANDOMIZE_BASE

UEK4 needs at least some degree of KASLR; to that end enable
RANDOMIZE_BASE and set RANDOMIZE_BASE_MAX_OFFSET to its default value
(0x40000000).

Orabug: 29305587

Signed-off-by: John Haxby <john.haxby@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

slub: make ->cpu_partial unsigned

/*
* cpu_partial determined the maximum number of objects
* kept in the per cpu partial lists of a processor.
*/

Can't be negative.

Orabug: 28620592

We can't reproduce the issue, this patch is expected to help in theory.

Link: http://lkml.kernel.org/r/20180305200730.15812-15-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: John Sobecki <john.sobecki@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

dtrace: support kernels built with RANDOMIZE_BASE

SDT probe addresses were being generated as absolute addresses which
breaks when the kernel may get relocated to a place other than the
default load address.

The solution is to generate the probe locations as an offset relative to
the _stext symbol in the .tmp_sdtinfo.S source file (generated at build
time), so that the actual addresses are processed as relocations when the
kernel boots.

This fix also optimizes the SDT info data (function and probe names) by
using de-duplication since especially with perf probes) many of those
strings are non-unique.

Orabug: 29204005
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Nick Alcock <nick.alcock@oracle.com>
Tested-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs: Fix the AMD SSBD usage of the SPEC_CTRL MSR

On AMD, the presence of the MSR_SPEC_CTRL feature does not imply that the
SSBD mitigation support should use the SPEC_CTRL MSR. Other features could
have caused the MSR_SPEC_CTRL feature to be set, while a different SSBD
mitigation option is in place.

Update the SSBD support to check for the actual SSBD features that will
use the SPEC_CTRL MSR.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 6ac2f49edb1e ("x86/bugs: Add AMD's SPEC_CTRL MSR usage")
Link: http://lkml.kernel.org/r/20180702213602.29202.33151.stgit@tlendack-t1.amdoffice.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 612bc3b3d4be749f73a513a17d9b3ee1330d3487)

Orabug: 28870524
CVE: CVE-2018-3639

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
Different filename (bugs_64.c) and different context

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/bugs: Add AMD's SPEC_CTRL MSR usage

The AMD document outlining the SSBD handling
124441_AMD64_SpeculativeStoreBypassDisable_Whitepaper_final.pdf
mentions that if CPUID 8000_0008.EBX[24] is set we should be using
the SPEC_CTRL MSR (0x48) over the VIRT SPEC_CTRL MSR (0xC001_011f)
for speculative store bypass disable.

This in effect means we should clear the X86_FEATURE_VIRT_SSBD
flag so that we would prefer the SPEC_CTRL MSR.

See the document titled:
124441_AMD64_SpeculativeStoreBypassDisable_Whitepaper_final.pdf

A copy of this document is available at
https://bugzilla.kernel.org/show_bug.cgi?id=199889

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Cc: kvm@vger.kernel.org
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Cc: andrew.cooper3@citrix.com
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20180601145921.9500-3-konrad.wilk@oracle.com
(cherry picked from commit 6ac2f49edb1ef5446089c7c660017732886d62d6)

Orabug: 28870524
CVE: CVE-2018-3639

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/cpufeatures.h
arch/x86/kernel/cpu/bugs.c
arch/x86/kernel/cpu/common.c
arch/x86/kvm/cpuid.c
arch/x86/kvm/svm.c
The conflicts were due to different filenames (cpufeature.h, bugs_64.c) and different context.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/cpufeatures: rename X86_FEATURE_AMD_SSBD to X86_FEATURE_LS_CFG_SSBD

The commit 52817587e706 ('x86/cpufeatures: Disentangle SSBD enumeration') from
upstream disentangles SSBD enumeration. We did not backport that commit because
we did not have what to disentangle on UEK4. Our cpufeature was already
synthetic.

That commit also renames X86_FEATURE_AMD_SSBD to X86_FEATURE_LS_CFG_SSBD. We
need this rename in order to not have conflicting cpu features while
backporting commit 6ac2f49edb1e ('x86/bugs: Add AMD's SPEC_CTRL MSR usage')
from upstream which introduces SPEC_CTRL MSR, which will be the prefered
method.

Orabug: 28870524
CVE: CVE-2018-3639

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Make file credentials available to the seqfile interfaces

A lot of seqfile users seem to be using things like %pK that uses the
credentials of the current process, but that is actually completely
wrong for filesystem interfaces.

The unix semantics for permission checking files is to check permissions
at _open_ time, not at read or write time, and that is not just a small
detail: passing off stdin/stdout/stderr to a suid application and making
the actual IO happen in privileged context is a classic exploit
technique.

So if we want to be able to look at permissions at read time, we need to
use the file open credentials, not the current ones.  Normal file
accesses can just use "f_cred" (or any of the helper functions that do
that, like file_ns_capable()), but the seqfile interfaces do not have
any such options.

It turns out that seq_file _does_ save away the user_ns information of
the file, though.  Since user_ns is just part of the full credential
information, replace that special case with saving off the cred pointer
instead, and suddenly seq_file has all the permission information it
needs.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 34dbbcdbf63360661ff7bda6c5f52f99ac515f92)

Orabug: 29114879
CVE: CVE-2018-17972

Conflict:  Refactored include/linux/seq_file.h to include __GENKSYM__
and UEK_KABI_REPLACE() to pass check_kabi test.

Signed-off-by: John Donnelly <john.p.donnelly@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

proc: restrict kernel stack dumps to root

Currently, you can use /proc/self/task/*/stack to cause a stack walk on
a task you control while it is running on another CPU.  That means that
the stack can change under the stack walker.  The stack walker does
have guards against going completely off the rails and into random
kernel memory, but it can interpret random data from your kernel stack
as instruction pointers and stack pointers.  This can cause exposure of
kernel stack contents to userspace.

Restrict the ability to inspect kernel stacks of arbitrary tasks to root
in order to prevent a local attacker from exploiting racy stack unwinding
to leak kernel task stack contents.  See the added comment for a longer
rationale.

There don't seem to be any users of this userspace API that can't
gracefully bail out if reading from the file fails.  Therefore, I believe
that this change is unlikely to break things.  In the case that this patch
does end up needing a revert, the next-best solution might be to fake a
single-entry stack based on wchan.

Link: http://lkml.kernel.org/r/20180927153316.200286-1-jannh@google.com
Fixes: 2ec220e27f50 ("proc: add /proc/*/stack")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit f8a00cef17206ecd1b30d3d9f99e10d9fa707aa7)

Orabug: 29114879
CVE: CVE-2018-17972

Signed-off-by: John Donnelly <john.p.donnelly@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Clean up retpoline code in bugs.c

Now that the minimal retpoline modes are removed, also remove
unnecessary checks to simplify retpoline code.

Orabug: 29211617

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86, modpost: Replace last remnants of RETPOLINE with CONFIG_RETPOLINE

Commit

4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support")

replaced the RETPOLINE define with CONFIG_RETPOLINE checks. Remove the
remaining pieces.

[ bp: Massage commit message. ]

Fixes: 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support")
Signed-off-by: WANG Chao <chao.wang@ucloud.cn>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: linux-kbuild@vger.kernel.org
Cc: srinivas.eeda@oracle.com
Cc: stable <stable@vger.kernel.org>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20181210163725.95977-1-chao.wang@ucloud.cn
(cherry picked from commit e4f358916d528d479c3c12bd2fd03f2d5a576380)

Orabug: 29211617

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
include/linux/compiler-gcc.h
include/linux/module.h
UEK4 either implements the changes in different files
or it does not have the patches that introduce the
lines changed by this cherry-picked commit.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/build: Fix compiler support check for CONFIG_RETPOLINE

It is troublesome to add a diagnostic like this to the Makefile
parse stage because the top-level Makefile could be parsed with
a stale include/config/auto.conf.

Once you are hit by the error about non-retpoline compiler, the
compilation still breaks even after disabling CONFIG_RETPOLINE.

The easiest fix is to move this check to the "archprepare" like
this commit did:

829fe4aa9ac1 ("x86: Allow generating user-space headers without a compiler")

Reported-by: Meelis Roos <mroos@linux.ee>
Tested-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Fixes: 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support")
Link: http://lkml.kernel.org/r/1543991239-18476-1-git-send-email-yamada.masahiro@socionext.com
Link: https://lkml.org/lkml/2018/12/4/206
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 25896d073d8a0403b07e6dec56f58e6c33678207)

Orabug: 29211617

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/Makefile
The archprepare rule is different in UEK and upstream makefiles

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/retpoline: Remove minimal retpoline support

Now that CONFIG_RETPOLINE hard depends on compiler support, there is no
reason to keep the minimal retpoline support around which only provided
basic protection in the assembly files.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Borislav Petkov <bp@suse.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <srinivas.eeda@oracle.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/f06f0a89-5587-45db-8ed2-0a9d6638d5c0@default
(cherry picked from commit ef014aae8f1cd2793e4e014bbb102bed53f852b7)

Orabug: 29211617

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
UEK4 has the corresponding code in bugs_64.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support

Since retpoline capable compilers are widely available, make
CONFIG_RETPOLINE hard depend on the compiler capability.

Break the build when CONFIG_RETPOLINE is enabled and the compiler does not
support it. Emit an error message in that case:

"arch/x86/Makefile:226: *** You are building kernel with non-retpoline
compiler, please update your compiler.. Stop."

[dwmw: Fail the build with non-retpoline compiler]

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Borislav Petkov <bp@suse.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: <srinivas.eeda@oracle.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/cca0cb20-f9e2-4094-840b-fb0f8810cd34@default
(cherry picked from commit 4cd24de3a0980bf3100c9dcb08ef65ca7c31af48)

Orabug: 29211617

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/Kconfig
arch/x86/include/asm/nospec-branch.h
Minor differences between UEK and upstream.
arch/x86/Makefile
Need to add line defining RETPOLINE_CFLAGS.
arch/x86/kernel/cpu/bugs.c
UEK4 has the corresponding code in bugs_64.c
scripts/Makefile.build
Commit e699314 (objtool: Add retpoline validation) has not
been ported to UEK4, nothing to change.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

nl80211: check for the required netlink attributes presence

nl80211_set_rekey_data() does not check if the required attributes
NL80211_REKEY_DATA_{REPLAY_CTR,KEK,KCK} are present when processing
NL80211_CMD_SET_REKEY_OFFLOAD request. This request can be issued by
users with CAP_NET_ADMIN privilege and may result in NULL dereference
and a system crash. Add a check for the required attributes presence.
This patch is based on the patch by bo Zhang.

This fixes CVE-2017-12153.

References: https://bugzilla.redhat.com/show_bug.cgi?id=1491046
Fixes: e5497d766ad ("cfg80211/nl80211: support GTK rekey offload")
Cc: <stable@vger.kernel.org> # v3.1-rc1
Reported-by: bo Zhang <zhangbo5891001@gmail.com>
Signed-off-by: Vladis Dronov <vdronov@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
(cherry picked from commit e785fa0a164aa11001cba931367c7f94ffaff888)

Orabug: 29245533
CVE: CVE-2017-12153

Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: lpfc: Fix PT2PT PRLI reject (reapply patch)

[backport of 114e80db15039e248eb4e458559cef57737930a8]
From: rkennedy <dick.kennedy@avagotech.com>

Orabug: 29281346

lpfc cannot establish connection with targets that send PRLI in P2P
configurations.

If lpfc rejects a PRLI that is sent from a target the target will not
resend and will reject the PRLI send from the initiator.

[tv: original mistakenly applied in reverse, because change was already
present in the code at that point; this reapplies forwards]

Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com>
Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Todd Vierling <todd.vierling@oracle.com>
Reviewed-by: Fred Herard <fred.herard@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

rds: congestion updates can be missed when kernel low on memory

The congestion updates are allocated under GFP_NOWAIT and can
fail under temporary memory pressure. These are not retried and
the update here retries them until sent.

On receiving congestion updates, corrupt packet check failures
are not logged as warnings.

Orabug: 28425811

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net/rds: ib: Fix endless RNR Retries caused by memory allocation failures

Temporary memory allocation failures may cause an RDS connection
to be stuck in an endless RNR (receiver Not Ready). Right around the
time the RDS connection becomes stuck, it reports these recv buffer
allocation failures:

rcuos/10: page allocation failure: order:2, mode:0x2
Call Trace:
<IRQ> [<ffffffff81698cc0>] dump_stack+0x63/0x83
[<ffffffff8118e59a>] warn_alloc_failed+0xea/0x140
[<ffffffff810b93fa>] ? select_idle_sibling+0x2a/0x120
[<ffffffff81191e09>] __alloc_pages_slowpath+0x409/0x760
[<ffffffff81192411>] __alloc_pages_nodemask+0x2b1/0x2d0
[<ffffffff810bae62>] ? check_preempt_wakeup+0x112/0x230
[<ffffffff811dc3af>] alloc_pages_current+0xaf/0x170
[<ffffffffa12e2090>] rds_page_remainder_alloc+0x60/0x2a4
[<ffffffffa0b1c0ac>] rds_ib_refill_one_frag+0x13c/0x200 [rds_rdma]
[<ffffffffa12994cd>] rds_ib_recv_refill_one+0x8d/0x220
[<ffffffffa0b1dfbf>] rds_ib_recv_refill+0x11f/0x340 [rds_rdma]
[<ffffffffa129989e>] rds_ib_recv_cqe_handler+0x23e/0x290
[<ffffffffa0b19326>] poll_cq+0x66/0xe0 [rds_rdma]
[<ffffffffa0b1945d>] rds_ib_rx+0xbd/0x210 [rds_rdma]
[<ffffffffa0b1964a>] rds_ib_tasklet_fn_recv+0x3a/0x50 [rds_rdma]
[<ffffffff81088361>] tasklet_action+0xb1/0xc0
[<ffffffff8108871a>] __do_softirq+0x10a/0x350
[<ffffffff8169f53c>] do_softirq_own_stack+0x1c/0x30
<EOI> [<ffffffff81088445>] do_softirq+0x55/0x60
[<ffffffff81088528>] __local_bh_enable_ip+0x88/0x90
[<ffffffff810e86d1>] rcu_nocb_kthread+0xf1/0x180
[<ffffffff810e85e0>] ? print_cpu_stall+0x170/0x170
[<ffffffff810e85e0>] ? print_cpu_stall+0x170/0x170
[<ffffffff810a465e>] kthread+0xce/0xf0
[<ffffffff810a4590>] ? kthread_freezable_should_stop+0x70/0x70
[<ffffffff8169dda2>] ret_from_fork+0x42/0x70
[<ffffffff810a4590>] ? kthread_freezable_should_stop+0x70/0x70

We re-schedule recv buffer refiller on satisfying these conditions:

if (rds_conn_up(conn) &&
   (must_wake || (can_wait && ring_low)
              || rds_ib_ring_empty(&ic->i_recv_ring))) {
   queue_delayed_work(conn->c_wq, &conn->c_recv_w, 1);
}

This currently doesn't take into account memory allocation failures.

A bit later the memory pressure clears away.
But RDS does not refill receive buffers for that connection any more.
This is because the receiver is only woken up on the last packet of a
multi-packet message. But the last packet is never received, because the
recv queue becomes empty and we end up in the endless RNR Retry situation.

Orabug: 28127993

Consultation with: Haakon Bugge

Reviewed-by: Yanjun Zhu <yanjun.zhu@oracle.com>
Reviewed-by: Haakon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: rds: fix excess initialization of the recv SGEs

In rds_ib_recv_init_ring(), an excess array element is incorrectly
initialized. This is not an OOB situation, as the sge array is
initialized to eight entries. With a fragment size of a maximum of 16KiB
and a page size of minimum 4KiB, then num_send_sge can at most become
five.

Orabug: 29004503

Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xhci: fix usb2 resume timing and races.

According to USB 2 specs ports need to signal resume for at least 20ms,
in practice even longer, before moving to U0 state.
Both host and devices can initiate resume.

On device initiated resume, a port status interrupt with the port in resume
state in issued. The interrupt handler tags a resume_done[port]
timestamp with current time + USB_RESUME_TIMEOUT, and kick roothub timer.
Root hub timer requests for port status, finds the port in resume state,
checks if resume_done[port] timestamp passed, and set port to U0 state.

On host initiated resume, current code sets the port to resume state,
sleep 20ms, and finally sets the port to U0 state. This should also
be changed to work in a similar way as the device initiated resume, with
timestamp tagging, but that is not yet tested and will be a separate
fix later.

There are a few issues with this approach

1. A host initiated resume will also generate a resume event. The event
   handler will find the port in resume state, believe it's a device
   initiated resume, and act accordingly.

2. A port status request might cut the resume signalling short if a
   get_port_status request is handled during the host resume signalling.
   The port will be found in resume state. The timestamp is not set leading
   to time_after_eq(jiffies, timestamp) returning true, as timestamp = 0.
   get_port_status will proceed with moving the port to U0.

3. If an error, or anything else happens to the port during device
   initiated resume signalling it will leave all the device resume
   parameters hanging uncleared, preventing further suspend, returning
   -EBUSY, and cause the pm thread to busyloop trying to enter suspend.

Fix this by using the existing resuming_ports bitfield to indicate that
resume signalling timing is taken care of.
Check if the resume_done[port] is set before using it for timestamp
comparison, and also clear out any resume signalling related variables
if port is not in U0 or Resume state

This issue was discovered when a PM thread busylooped, trying to runtime
suspend the xhci USB 2 roothub on a Dell XPS

Cc: stable <stable@vger.kernel.org>
Reported-by: Daniel J Blueman <daniel@quora.org>
Tested-by: Daniel J Blueman <daniel@quora.org>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 29028940

(cherry picked from commit f69115fdbc1ac0718e7d19ad3caa3da2ecfe1c96)
Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xhci: Fix a race in usb2 LPM resume, blocking U3 for usb2 devices

Clear device initiated resume variables once device is fully up and running
in U0 state.

Resume needs to be signaled for 20ms for usb2 devices before they can be
moved to U0 state.

An interrupt is triggered if a device initiates resume. As we handle the
event in interrupt context we can not sleep for 20ms, so we instead set
a resume flag, a timestamp, and start the roothub polling.

The roothub code will later move the port to U0 when it finds a port in
resume state with the resume flag set, and timestamp passed by 20ms.

A host initiated resume is however not done in interrupt context, and
host initiated resume code will directly signal resume, wait 20ms and then
move the port to U0.

These two codepaths can race, if we are in the middle of a host initated
resume, while sleeping for 20ms, we may handle a port event and find the
port in resume state. The port event handling code will assume the resume
was device initiated and set the resume flag and timestamp.

Root hub code will however not catch the port in resume state again as the
host initated resume code has already moved the port to U0.
The resume flag and timestamp will remain set for this port preventing port
from suspending again (LPM setting port to U3)

Fix this for now by always clearing the device initated resume parameters
once port is in U0

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 29028940

(cherry picked from commit dad67d5f3d0efe01d38c6cebcb6698280e51927b)
Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/usb/host/xhci-hub.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

userfaultfd: check VM_MAYWRITE was set after verifying the uffd is registered

Calling UFFDIO_UNREGISTER on virtual ranges not yet registered in uffd
could trigger an harmless false positive WARN_ON. Check the vma is
already registered before checking VM_MAYWRITE to shut off the false
positive warning.

Link: http://lkml.kernel.org/r/20181206212028.18726-2-aarcange@redhat.com
Cc: <stable@vger.kernel.org>
Fixes: 29ec90660d68 ("userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: syzbot+06c7092e7d71218a2c16@syzkaller.appspotmail.com
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 29163750
CVE: CVE-2018-18397

commit 01e881f5a1fca4677e82733061868c6d6ea05ca7 upstream

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/userfaultfd.c

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas

After the VMA to register the uffd onto is found, check that it has
VM_MAYWRITE set before allowing registration. This way we inherit all
common code checks before allowing to fill file holes in shmem and
hugetlbfs with UFFDIO_COPY.

The userfaultfd memory model is not applicable for readonly files unless
it's a MAP_PRIVATE.

Link: http://lkml.kernel.org/r/20181126173452.26955-4-aarcange@redhat.com
Fixes: ff62a3421044 ("hugetlb: implement memfd sealing")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Hugh Dickins <hughd@google.com>
Reported-by: Jann Horn <jannh@google.com>
Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
Cc: <stable@vger.kernel.org>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 29163750
CVE: CVE-2018-18397

commit 29ec90660d68bbdd69507c1c8b4e33aa299278b1 upstream

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/userfaultfd.c
mm/userfaultfd.c

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/apic/x2apic: set affinity of a single interrupt to one cpu

Customer want to offline the cpus to 2 per node. And finally, a
lpfc HBA cannot work any more due to no available
irq vectors.
[   51.031812] IRQ 284 set affinity failed because there are no available vectors.  The device assigned to this IRQ is unstable.
[   51.031817] IRQ 285 set affinity failed because there are no available vectors.  The device assigned to this IRQ is unstable.
[   51.031822] IRQ 286 set affinity failed because there are no available vectors.  The device assigned to this IRQ is unstable.
[   51.031827] IRQ 287 set affinity failed because there are no available vectors.  The device assigned to this IRQ is unstable.

It was due to cluster_vector_allocation_domain which want to set
interrupt affinity of a single interrupt to multiple CPUs and need
a same irq vector to be available on multiple cpus. This is difficult
for customer's case where there are a lot of HBAs on node 0 and only
2 or 4 cpus online there.

And actually, this feature has been discarded by the upstream.
https://lkml.org/lkml/2017/9/13/576
We close this feature by just set one cpu in retmask in
cluster_vector_allocation_domain.

Customer that encountered this issue used RHCK, since UEK4 also
has the same code, post a same patch for UEK4

Orabug: 29196396

Reviewed-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang.oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen/blkback: rework validate_io_op()

Rework many if statements in validate_io_op() into a switch statement.

Orabug: 29199843

Suggested-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen/blkback: optimize validate_io_op() to filter BLKIF_OP_RESERVED_1 operation

Instead of hardcoding operation = 4, BLKIF_OP_RESERVED_1 = 4 is defined in
the header file.

Orabug: 29199843

Suggested-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Reviewed-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen/blkback: do not BUG() for invalid blkif_request from frontend

Upstream commit 0e367ae46503 ("xen/blkback: correctly respond to unknown,
non-native requests") fixed a bug to correctly respond to unknown,
non-native requests, e.g., BLKIF_OP_RESERVED_1 or BLKIF_OP_PACKET for
64-bit SLES 11 guests when using a 32-bit backend.

Although such fix is already in uek4, it is broken by commit f0af2f840606
("xen-blkback: move indirect req allocation out-of-line") that introduced
the BUG() again.

This patch removes the BUG() to avoid panic backend by invalid
blkif_request from frontend.

Orabug: 29199843

Fixes: f0af2f840606 ("xen-blkback: move indirect req allocation out-of-line")
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Reviewed-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net/rds: WARNING: at net/rds/recv.c:222 rds_recv_hs_exthdrs+0xf8/0x1e0

The stack trace looks as follows:

WARNING: at net/rds/recv.c:222 rds_recv_hs_exthdrs+0xf8/0x1e0 [rds]()
Call Trace:
dump_stack+0x63/0x81
warn_slowpath_common+0x8a/0xc0
warn_slowpath_null+0x1a/0x20
rds_recv_hs_exthdrs+0xf8/0x1e0 [rds]
rds_recv_local.isra.7+0x396/0x440 [rds]
rds_recv_incoming+0x2d8/0x3c0 [rds]
rds_ib_recv_cqe_handler+0x44f/0x6d0 [rds_rdma]
poll_rcq+0x7a/0xa0 [rds_rdma]
rds_ib_rx+0xa4/0x220 [rds_rdma]
rds_ib_tasklet_fn_recv+0x30/0x40 [rds_rdma]
...

commit 041dc3e4d3
("Backport multipath RDS from upstream to UEK4") treats an
incoming rds ping or rds pong differently if the local (in case of pong) or
sender's port (in case of ping) is 1 (RDS_FLAG_PROBE_PORT).

There is nothing stopping rds-ping from picking this port for it's local side
since it does wildcard socket bind.

The fix is to check for t_mp_capable transport.

Orabug: 29201779

Reviewed-by: Haakon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen-netback: wake up xenvif_dealloc_kthread when it should stop

The feature 'staging grant' changed the behaviour of
xenvif_zerocopy_callback() that queue->dealloc_prod may not increase during
the do-while loop because of 'staging grant'. As a result,
xenvif_skb_zerocopy_complete() would not wake up xenvif_dealloc_kthread
because (prod == queue->dealloc_prod).

This makes trouble when the xenvif_dealloc_kthread is requested to stop by
xenvif_disconnect(). When xenvif_dealloc_kthread is stopped while
inflight_packets is not 0, xenvif_dealloc_kthread would not exit until
inflight_packets becomes 0.

However, because of 'staging grant', xenvif_skb_zerocopy_complete() would
not wake up xenvif_dealloc_kthread() although inflight_packets is
decremented and already becomes 0. As a result, xenvif_dealloc_kthread will
never wakes up.

xenvif_skb_zerocopy_complete() should wake up xenvif_dealloc_kthread when
the latter is in the progress to stop.

Orabug: 29217927

Fixes: fdbb2e3659b3 ("xen-netback: use gref mappings for Tx requests")
Reported-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "xfs: remove nonblocking mode from xfs_vm_writepage"

This reverts commit 6e2de7d4578d4f6ae76979286de5c5ee8e91754a.

These commits are very possibly to cause SIGBUS issue. (We can't verify
that in customer's environment). Revert them.

Orabug: 29279692

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "xfs: remove xfs_cancel_ioend"

This reverts commit c680537035066177fa845053354974f7245c02d8.

These commits are very possibly to cause SIGBUS issue. (We can't verify
that in customer's environment). Revert them.

Orabug: 29279692

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "xfs: Introduce writeback context for writepages"

This reverts commit b104054b547e9034c5c7bf763d08e9803b5b58ed.

These commits are very possibly to cause SIGBUS issue. (We can't verify
that in customer's environment). Revert them.

Orabug: 29279692

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "xfs: xfs_cluster_write is redundant"

This reverts commit e58eae1b82358f6df9a88b1312cac667b3d968db.

These commits are very possibly to cause SIGBUS issue. (We can't verify
that in customer's environment). Revert them.

Orabug: 29279692

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "xfs: factor mapping out of xfs_do_writepage"

This reverts commit 40a82631dc131f1b7f61b2a2fe6351c382aaf04f.

These commits are very possibly to cause SIGBUS issue. (We can't verify
that in customer's environment). Revert them.

Orabug: 29279692

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "xfs: don't chain ioends during writepage submission"

This reverts commit 34457adcadaf557febddb9f715368bbd5c3fd239.

These commits are very possibly to cause SIGBUS issue. (We can't verify
that in customer's environment). Revert them.

Orabug: 29279692

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mstflint: Fix coding style issues - left with LINUX_VERSION_CODE

Description:
Issue: 1471556

Orabug: 28878697

(cherry picked from commit 30e70911bcc22ac77b13d537225d7499261caac8)
cherry-pick-repo=github.com/Mellanox/mstflint.git

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
Files are relocated from 'kernel' directory to
'drivers/net/ethernet/mellanox/mstflint_access'

Signed-off-by: Qing Huang <qing.huang@oracle.com>
Reviewed-by: Aron Silverton <aron.silverton@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mstflint: Fix coding-style issues

Description:
Issue: 1471556

Orabug: 28878697

(cherry picked from commit d514e6f02dcd8436e864e8113fe010898be56d10)
cherry-pick-repo=github.com/Mellanox/mstflint.git

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
Files are relocated from 'kernel' directory to
'drivers/net/ethernet/mellanox/mstflint_access'

Signed-off-by: Qing Huang <qing.huang@oracle.com>
Reviewed-by: Aron Silverton <aron.silverton@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mstflint: Fix errors found with checkpatch script

Description:
Issue: 1471556

Title: Fix compilation isuue

Description:
Issue: N/A

Orabug: 28878697

(cherry picked from commit 8154be122d0f841208b787b728085c565710e0f7
and dfec3c77f977344d234c93704e59a5ca12832ab1)
cherry-pick-repo=github.com/Mellanox/mstflint.git

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
Files are relocated from 'kernel' directory to
'drivers/net/ethernet/mellanox/mstflint_access'
Squashed two commits since the 1st commit has a compilation
issue which is fixed by the 2nd commit.

Signed-off-by: Qing Huang <qing.huang@oracle.com>
Reviewed-by: Aron Silverton <aron.silverton@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Added support for 5th Gen devices in Secure Boot module and mtcr

Signed-off-by: Adham Masarwah <adham@mellanox.com>
Orabug: 28878697

(cherry picked from commit 4cbcf2923e05d74694fa2a5355960ca979ee8a97)
cherry-pick-repo=github.com/Mellanox/mstflint.git

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
Files are relocated from 'kernel' directory to
'drivers/net/ethernet/mellanox/mstflint_access'

Signed-off-by: Qing Huang <qing.huang@oracle.com>
Reviewed-by: Aron Silverton <aron.silverton@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Fix typos in mst_kernel

Signed-off-by: Adham Masarwah <adham@mellanox.com>
Orabug: 28878697

(cherry picked from commit 5fd539b720c95b557f55aa6465fc220415d3dca4)
cherry-pick-repo=github.com/Mellanox/mstflint.git

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
Files are relocated from 'kernel' directory to
'drivers/net/ethernet/mellanox/mstflint_access'

Signed-off-by: Qing Huang <qing.huang@oracle.com>
Reviewed-by: Aron Silverton <aron.silverton@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bnxt_en: Report PCIe link properties with pcie_print_link_status()

Orabug: 28942099

Previously the driver used pcie_get_minimum_link() to warn when the NIC
is in a slot that can't supply as much bandwidth as the NIC could use.

pcie_get_minimum_link() can be misleading because it finds the slowest link
and the narrowest link (which may be different links) without considering
the total bandwidth of each link.  For a path with a 16 GT/s x1 link and a
2.5 GT/s x16 link, it returns 2.5 GT/s x1, which corresponds to 250 MB/s of
bandwidth, not the true available bandwidth of about 1969 MB/s for a
16 GT/s x1 link.

Use pcie_print_link_status() to report PCIe link speed and possible
limitations instead of implementing this in the driver itself.  This finds
the slowest link in the path to the device by computing the total bandwidth
of each link and compares that with the capabilities of the device.

The dmesg change is:

  - PCIe: Speed %s Width x%d
  + %u.%03u Gb/s available PCIe bandwidth (%s x%d link)

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
[backport of upstream commit af125b754e2f09e6061e65db8f4eda0f7730011d]

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
[backport of UEK5 commit 48b32a7f2b4dddafbf42cde882c3c84c556fb477]
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt_compat.c
drivers/net/ethernet/broadcom/bnxt/bnxt_compat.h

Signed-off-by: Brian Maly <brian.maly@oracle.com>

selinux: Perform both commoncap and selinux xattr checks

When selinux is loaded the relax permission checks for writing
security.capable are not honored.  Which keeps file capabilities
from being used in user namespaces.

Stephen Smalley <sds@tycho.nsa.gov> writes:
> Originally SELinux called the cap functions directly since there was no
> stacking support in the infrastructure and one had to manually stack a
> secondary module internally.  inode_setxattr and inode_removexattr
> however were special cases because the cap functions would check
> CAP_SYS_ADMIN for any non-capability attributes in the security.*
> namespace, and we don't want to impose that requirement on setting
> security.selinux.  Thus, we inlined the capabilities logic into the
> selinux hook functions and adapted it appropriately.

Now that the permission checks in commoncap have evolved this
inlining of their contents has become a problem.  So restructure
selinux_inode_removexattr, and selinux_inode_setxattr to call
both the corresponding cap_inode_ function and dentry_has_perm
when the attribute is not a selinux security xattr.   This ensures
the policies of both commoncap and selinux are enforced.

This results in smack and selinux having the same basic structure
for setxattr and removexattr.  Performing their own special permission
checks when it is their modules xattr being written to, and deferring
to commoncap when that is not the case.  Then finally performing their
generic module policy on all xattr writes.

This structure is fine when you only consider stacking with the
commoncap lsm, but it becomes a problem if two lsms that don't want
the commoncap security checks on their own attributes need to be
stack.  This means there will need to be updates in the future as lsm
stacking is improved, but at least now the structure between smack and
selinux is common making the code easier to refactor.

This change also has the effect that selinux_linux_setotherxattr becomes
unnecessary so it is removed.

Fixes: 8db6c34f1dbc ("Introduce v3 namespaced file capabilities")
Fixes: 7bbf0e052b76 ("[PATCH] selinux merge")
Historical Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Paul Moore <paul@paul-moore.com>
(cherry picked from commit 6b240306ee1631587a87845127824df54a0a5abe)

Orabug: 28951521

Signed-off-by: Gayatri Vasudevan <gayatri.vasudevan@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
security/selinux/hooks.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

Introduce v3 namespaced file capabilities

Root in a non-initial user ns cannot be trusted to write a traditional
security.capability xattr.  If it were allowed to do so, then any
unprivileged user on the host could map his own uid to root in a private
namespace, write the xattr, and execute the file with privilege on the
host.

However supporting file capabilities in a user namespace is very
desirable.  Not doing so means that any programs designed to run with
limited privilege must continue to support other methods of gaining and
dropping privilege.  For instance a program installer must detect
whether file capabilities can be assigned, and assign them if so but set
setuid-root otherwise.  The program in turn must know how to drop
partial capabilities, and do so only if setuid-root.

This patch introduces v3 of the security.capability xattr.  It builds a
vfs_ns_cap_data struct by appending a uid_t rootid to struct
vfs_cap_data.  This is the absolute uid_t (that is, the uid_t in user
namespace which mounted the filesystem, usually init_user_ns) of the
root id in whose namespaces the file capabilities may take effect.

When a task asks to write a v2 security.capability xattr, if it is
privileged with respect to the userns which mounted the filesystem, then
nothing should change.  Otherwise, the kernel will transparently rewrite
the xattr as a v3 with the appropriate rootid.  This is done during the
execution of setxattr() to catch user-space-initiated capability writes.
Subsequently, any task executing the file which has the noted kuid as
its root uid, or which is in a descendent user_ns of such a user_ns,
will run the file with capabilities.

Similarly when asking to read file capabilities, a v3 capability will
be presented as v2 if it applies to the caller's namespace.

If a task writes a v3 security.capability, then it can provide a uid for
the xattr so long as the uid is valid in its own user namespace, and it
is privileged with CAP_SETFCAP over its namespace.  The kernel will
translate that rootid to an absolute uid, and write that to disk.  After
this, a task in the writer's namespace will not be able to use those
capabilities (unless rootid was 0), but a task in a namespace where the
given uid is root will.

Only a single security.capability xattr may exist at a time for a given
file.  A task may overwrite an existing xattr so long as it is
privileged over the inode.  Note this is a departure from previous
semantics, which required privilege to remove a security.capability
xattr.  This check can be re-added if deemed useful.

This allows a simple setxattr to work, allows tar/untar to work, and
allows us to tar in one namespace and untar in another while preserving
the capability, without risking leaking privilege into a parent
namespace.

Example using tar:

$ cp /bin/sleep sleepx
$ mkdir b1 b2
$ lxc-usernsexec -m b:0:100000:1 -m b:1:$(id -u):1 -- chown 0:0 b1
$ lxc-usernsexec -m b:0:100001:1 -m b:1:$(id -u):1 -- chown 0:0 b2
$ lxc-usernsexec -m b:0:100000:1000 -- tar --xattrs-include=security.capability --xattrs -cf b1/sleepx.tar sleepx
$ lxc-usernsexec -m b:0:100001:1000 -- tar --xattrs-include=security.capability --xattrs -C b2 -xf b1/sleepx.tar
$ lxc-usernsexec -m b:0:100001:1000 -- getcap b2/sleepx
   b2/sleepx = cap_sys_admin+ep
# /opt/ltp/testcases/bin/getv3xattr b2/sleepx
   v3 xattr, rootid is 100001

A patch to linux-test-project adding a new set of tests for this
functionality is in the nsfscaps branch at github.com/hallyn/ltp

Changelog:
   Nov 02 2016: fix invalid check at refuse_fcap_overwrite()
   Nov 07 2016: convert rootid from and to fs user_ns
   (From ebiederm: mar 28 2017)
     commoncap.c: fix typos - s/v4/v3
     get_vfs_caps_from_disk: clarify the fs_ns root access check
     nsfscaps: change the code split for cap_inode_setxattr()
   Apr 09 2017:
       don't return v3 cap for caps owned by current root.
      return a v2 cap for a true v2 cap in non-init ns
   Apr 18 2017:
      . Change the flow of fscap writing to support s_user_ns writing.
      . Remove refuse_fcap_overwrite().  The value of the previous
        xattr doesn't matter.
   Apr 24 2017:
      . incorporate Eric's incremental diff
      . move cap_convert_nscap to setxattr and simplify its usage
   May 8, 2017:
      . fix leaking dentry refcount in cap_inode_getsecurity

Signed-off-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
(cherry picked from commit 8db6c34f1dbc8e06aa016a9b829b06902c3e1340)

Orabug: 28951521

UEK4 does not support marking user namespace owner for a filesystem.
Adding that support requires cherrypicking below commit from mainline

6e4eab577a0cae15b3da9b888cff16fe57981b3e
“(fs: Add user namespace member to struct super_block)”

This would break KABI. So, in UEK4, the user namespace owner
for a super_block is always init_user_ns.

UEK4 also does not have the lsm hook framework which
was added to mainline by the following commit

b1d9e6b0646d0e5ee5d9050bd236b6c65d66faef
“(LSM: Switch to lists of hooks)”

So, this backport ignores the change in LSM_HOOK.

Signed-off-by: Gayatri Vasudevan <gayatri.vasudevan@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
security/commoncap.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

rds: ib: Use a delay when reconnecting to the very same IP address

An RDS IB connection may be formed from the very same IB port using
HCA level internal loop-back. If this connection attempt is performed
after RDS has cleared the ARP cache of the same IP address, an ARP IB
multicast is sent out on the IPoIB interface.

If the above scenario is performed on IPoIB interfaces that are
members of an IB Limited Partition, the ARP multicast will be dropped
by the HCA port. A corresponding PKey Violation is counted and a
corresponding PKey Violation Trap is sent to the OpenSM, subject to
rate control.

Now, due to a bug in RDS connection management, where it was not
anticipated that the peers of a connection could actually be the very
same port and have the same IP address, the reconnect attempts happens
with zero delay.

This leads to about 7700 connection attempts per second, about
4400 PKey Violations per second, and 8500 ARP multicasts per second.

This commit reduces the reconnect rate down to one second. This
because the RDS uses exponential backoff to calculate the delay, which
will shortly end up at rds_sysctl_reconnect_max_jiffies, which by
default is HZ, in other words, a delay at one second after the 10
first reconnects.

Orabug: 29138813

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Ka-cheong Poon <ka-cheong.poon@oracle.com>
---

v1 -> v2:
* Amended commit message as per Ka-Cheong's suggestions

Signed-off-by: Brian Maly <brian.maly@oracle.com>

Change mincore() to count "mapped" pages rather than "cached" pages

The semantics of what "in core" means for the mincore() system call are
somewhat unclear, but Linux has always (since 2.3.52, which is when
mincore() was initially done) treated it as "page is available in page
cache" rather than "page is mapped in the mapping".

The problem with that traditional semantic is that it exposes a lot of
system cache state that it really probably shouldn't, and that users
shouldn't really even care about.

So let's try to avoid that information leak by simply changing the
semantics to be that mincore() counts actual mapped pages, not pages
that might be cheaply mapped if they were faulted (note the "might be"
part of the old semantics: being in the cache doesn't actually guarantee
that you can access them without IO anyway, since things like network
filesystems may have to revalidate the cache before use).

In many ways the old semantics were somewhat insane even aside from the
information leak issue.  From the very beginning (and that beginning is
a long time ago: 2.3.52 was released in March 2000, I think), the code
had a comment saying

  Later we can get more picky about what "in core" means precisely.

and this is that "later".  Admittedly it is much later than is really
comfortable.

NOTE! This is a real semantic change, and it is for example known to
change the output of "fincore", since that program literally does a
mmmap without populating it, and then doing "mincore()" on that mapping
that doesn't actually have any pages in it.

I'm hoping that nobody actually has any workflow that cares, and the
info leak is real.

We may have to do something different if it turns out that people have
valid reasons to want the old semantics, and if we can limit the
information leak sanely.

Cc: Kevin Easton <kevin@guarana.org>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Masatake YAMATO <yamato@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 574823bfab82d9d8fa47f422778043fbb4b4f50e)
Orabug: 29187415
CVE: CVE-2019-5489
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: John Donnelly <John.P.Donnelly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
mm/mincore.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

NFSD: Set the attributes used to store the verifier for EXCLUSIVE4_1

According to rfc5661 18.16.4,
"If EXCLUSIVE4_1 was used, the client determines the attributes
used for the verifier by comparing attrset with cva_attrs.attrmask;"

So, EXCLUSIVE4_1 also needs those bitmask used to store the verifier.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Orabug: 29204157

(cherry picked from commit ead8fb8c24411722b92198b3dccd102a76cdd050)
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Reviewed-by: Bill Baker <Bill.Baker@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: update i_disksize when new eof exceeds it

Orabug: 28940828

This patch is a helper for back porting upstream commit 45d8ec4d9fd5
(ext4: update i_disksize if direct write past ondisk size), add a condition
to allow updating i_disksize through calling ext4_ind_direct_IO when the new
eof exceeds both i_size and i_disksize.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: update i_disksize if direct write past ondisk size

Currently in ext4 direct write path, we update i_disksize only when
new eof is greater than i_size, and don't update it even when new
eof is greater than i_disksize but less than i_size. This doesn't
work well with delalloc buffer write, which updates i_size and
i_disksize only when delalloc blocks are resolved (at writeback
time), the i_disksize from direct write can be lost if a previous
buffer write succeeded at write time but failed at writeback time,
then results in corrupted ondisk inode size.

Consider this case, first buffer write 4k data to a new file at
offset 16k with delayed allocation, then direct write 4k data to the
same file at offset 4k before delalloc blocks are resolved, which
doesn't update i_disksize because it writes within i_size(20k), but
the extent tree metadata has been committed in journal. Then
writeback of the delalloc blocks fails (due to device error etc.),
and i_size/i_disksize from buffer write can't be written to disk
(still zero). A subsequent umount/mount cycle recovers journal and
writes extent tree metadata from direct write to disk, but with
i_disksize being zero.

Fix it by updating i_disksize too in direct write path when new eof
is greater than i_disksize but less than i_size, so i_disksize is
always consistent with direct write.

This fixes occasional i_size corruption in fstests generic/475.

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Orabug: 28940828

commit 45d8ec4d9fd5468c08f2ef0b2b132bb62dc81a3d upstream

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/ext4/indirect.c
code line mismatch

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: protect i_disksize update by i_data_sem in direct write path

i_disksize update should be protected by i_data_sem, by either taking
the lock explicitly or by using ext4_update_i_disksize() helper. But the
i_disksize updates in ext4_direct_IO_write() are not protected at all,
which may be racing with i_disksize updates in writeback path in
delalloc buffer write path.

This is found by code inspection, and I didn't hit any i_disksize
corruption due to this bug. Thanks to Jan Kara for catching this bug and
suggesting the fix!

Reported-by: Jan Kara <jack@suse.cz>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Orabug: 28940828

commit 73fdad00b208b139cf43f3163fbc0f67e4c6047c upstream

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/ext4/indirect.c
code line mismatch

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ALSA: usb-audio: Fix UAF decrement if card has no live interfaces in card.c

If a USB sound card reports 0 interfaces, an error condition is triggered
and the function usb_audio_probe errors out. In the error path, there was a
use-after-free vulnerability where the memory object of the card was first
freed, followed by a decrement of the number of active chips. Moving the
decrement above the atomic_dec fixes the UAF.

[ The original problem was introduced in 3.1 kernel, while it was
  developed in a different form.  The Fixes tag below indicates the
  original commit but it doesn't mean that the patch is applicable
  cleanly. -- tiwai ]

Fixes: 362e4e49abe5 ("ALSA: usb-audio - clear chip->probing on error exit")
Reported-by: Hui Peng <benquike@gmail.com>
Reported-by: Mathias Payer <mathias.payer@nebelwelt.net>
Signed-off-by: Hui Peng <benquike@gmail.com>
Signed-off-by: Mathias Payer <mathias.payer@nebelwelt.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Orabug: 29042981
CVE: CVE-2018-19824
(cherry picked from commit 5f8cf712582617d523120df67d392059eaf2fc4b)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ALSA: usb-audio: Replace probing flag with active refcount

We can use active refcount for preventing autopm during probe.

Signed-off-by: Takashi Iwai <tiwai@suse.de>
Orabug: 29042981
CVE: CVE-2018-19824
(cherry picked from commit a6da499b76b1a75412f047ac388e9ffd69a5c55b)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ALSA: usb-audio: Avoid nested autoresume calls

After the recent fix of runtime PM for USB-audio driver, we got a
lockdep warning like:

  =============================================
  [ INFO: possible recursive locking detected ]
  4.2.0-rc8+ #61 Not tainted
  ---------------------------------------------
  pulseaudio/980 is trying to acquire lock:
   (&chip->shutdown_rwsem){.+.+.+}, at: [<ffffffffa0355dac>] snd_usb_autoresume+0x1d/0x52 [snd_usb_audio]
  but task is already holding lock:
   (&chip->shutdown_rwsem){.+.+.+}, at: [<ffffffffa0355dac>] snd_usb_autoresume+0x1d/0x52 [snd_usb_audio]

This comes from snd_usb_autoresume() invoking down_read() and it's
used in a nested way.  Although it's basically safe, per se (as these
are read locks), it's better to reduce such spurious warnings.

The read lock is needed to guarantee the execution of "shutdown"
(cleanup at disconnection) task after all concurrent tasks are
finished.  This can be implemented in another better way.

Also, the current check of chip->in_pm isn't good enough for
protecting the racy execution of multiple auto-resumes.

This patch rewrites the logic of snd_usb_autoresume() & co; namely,
- The recursive call of autopm is avoided by the new refcount,
  chip->active.  The chip->in_pm flag is removed accordingly.
- Instead of rwsem, another refcount, chip->usage_count, is introduced
  for tracking the period to delay the shutdown procedure.  At
  the last clear of this refcount, wake_up() to the shutdown waiter is
  called.
- The shutdown flag is replaced with shutdown atomic count; this is
  for reducing the lock.
- Two new helpers are introduced to simplify the management of these
  refcounts; snd_usb_lock_shutdown() increases the usage_count, checks
  the shutdown state, and does autoresume.  snd_usb_unlock_shutdown()
  does the opposite.  Most of mixer and other codes just need this,
  and simply returns an error if it receives an error from lock.

Fixes: 9003ebb13f61 ('ALSA: usb-audio: Fix runtime PM unbalance')
Reported-and-tested-by: Alexnader Kuleshov <kuleshovmail@gmail.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Orabug: 29042981
CVE: CVE-2018-19824
(cherry picked from commit 47ab154593827b1a8f0713a2b9dd445753d551d8)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>
Conflict:

sound/usb/mixer.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: validate that metadata blocks do not overlap superblock

A number of fuzzing failures seem to be caused by allocation bitmaps
or other metadata blocks being pointed at the superblock.

This can cause kernel BUG or WARNings once the superblock is
overwritten, so validate the group descriptor blocks to make sure this
doesn't happen.

Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
(cherry picked from commit 829fa70dddadf9dd041d62b82cd7cea63943899d)

Orabug: 29114440
CVE: CVE-2018-1094

Signed-off-by: John Donnelly <john.p.donnelly@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: update inline int ext4_has_metadata_csum(struct super_block *sb)

to include ext4_has_feature_metadata_csum(sb) check.

Orabug: 29114440
CVE: CVE-2018-1094

Signed-off-by: John Donnelly <john.p.donnelly@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: always initialize the crc32c checksum driver

The extended attribute code now uses the crc32c checksum for hashing
purposes, so we should just always always initialize it. We also want
to prevent NULL pointer dereferences if one of the metadata checksum
features is enabled after the file sytsem is originally mounted.

This issue has been assigned CVE-2018-1094.

https://bugzilla.kernel.org/show_bug.cgi?id=199183
https://bugzilla.redhat.com/show_bug.cgi?id=1560788

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
(cherry picked from commit a45403b51582a87872927a3e0fc0a389c26867f1)

Orabug: 29114440
CVE: CVE-2018-1094

Signed-off-by: John Donnelly <john.p.donnelly@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "bnxt_en: Reduce default rings on multi-port cards."

Orabug: 28687746

This reverts commit 143bdb401ce42631af3030f192c8fa6d148b9197.

This commit caused IRQs per dev to be reduced from 8 to 4 which resulted in TPCC throughput dropping by 18%.
Revert this commit so we have 8 IRQs per dev again.

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

mlx4_core: Disable P_Key Violation Traps

Exadata virt edition, actively using IB partitions, is exposed to
excessive P_Key Violation Traps being sent to the SM. This is close to
a DoS attack. In addition, the OpenSM logs are flooded with these
messages, hiding potential other log messages deemed important to
investigate customer issues.

In fw version 2.35.6312, the traps are disabled, still counting the
P-Key Violations.

This commit will conditionally disable the P_Key Violation Traps
subject to fw version.

Orabug: 27693633

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
---

v1 -> v2:
* Incorporated review comments form jch
* Made the disabling dependent on fw version

Signed-off-by: Brian Maly <brian.maly@oracle.com>

rds: RDS connection does not reconnect after CQ access violation error

The sequence that leads to this state is as follows.

1) First we see CQ error logged.

Sep 29 22:32:33 dm54cel14 kernel: [471472.784371] mlx4_core
0000:46:00.0: CQ access violation on CQN 000419 syndrome=0x2
vendor_error_syndrome=0x0

2) That is followed by the drop of the associated RDS connection.

Sep 29 22:32:33 dm54cel14 kernel: [471472.784403] RDS/IB: connection
<192.168.54.43,192.168.54.1,0> dropped due to 'qp event'

3) We don't get the WR_FLUSH_ERRs for the posted receive buffers after that.

4) RDS is stuck in rds_ib_conn_shutdown while shutting down that connection.

crash64> bt 62577
PID: 62577  TASK: ffff88143f045400  CPU: 4   COMMAND: "kworker/u224:1"
#0 [ffff8813663bbb58] __schedule at ffffffff816ab68b
#1 [ffff8813663bbbb0] schedule at ffffffff816abca7
#2 [ffff8813663bbbd0] schedule_timeout at ffffffff816aee71
#3 [ffff8813663bbc80] rds_ib_conn_shutdown at ffffffffa041f7d1 [rds_rdma]
#4 [ffff8813663bbd10] rds_conn_shutdown at ffffffffa03dc6e2 [rds]
#5 [ffff8813663bbdb0] rds_shutdown_worker at ffffffffa03e2699 [rds]
#6 [ffff8813663bbe00] process_one_work at ffffffff8109cda1
#7 [ffff8813663bbe50] worker_thread at ffffffff8109d92b
#8 [ffff8813663bbec0] kthread at ffffffff810a304b
#9 [ffff8813663bbf50] ret_from_fork at ffffffff816b0752
crash64>

It was stuck here in rds_ib_conn_shutdown for ever:

                /* quiesce tx and rx completion before tearing down */
                while (!wait_event_timeout(rds_ib_ring_empty_wait,
                                rds_ib_ring_empty(&ic->i_recv_ring) &&
                                (atomic_read(&ic->i_signaled_sends) == 0),
                                msecs_to_jiffies(5000))) {

                        /* Try to reap pending RX completions every 5 secs */
                        if (!rds_ib_ring_empty(&ic->i_recv_ring)) {
                                spin_lock_bh(&ic->i_rx_lock);
                                rds_ib_rx(ic);
                                spin_unlock_bh(&ic->i_rx_lock);
                        }
                }

The recv ring was not empty.
w_alloc_ptr = 560
w_free_ptr  = 256

This is what Mellanox had to say:
When CQ moves to error (e.g. due to CQ Overrun, CQ Access violation) FW will
generate Async event to notify this error, also the QPs that tries to access
this CQ will be put to error state but will not be flushed since we must not
post CQEs to a broken CQ. The QP that tries to access will also issue an
Async catas event.

In summary we cannot wait for any more WR_FLUSH_ERRs in that state.

Orabug: 28733324

Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

KVM/SVM: Allow direct access to MSR_IA32_SPEC_CTRL

[ Based on a patch from Paolo Bonzini <pbonzini@redhat.com> ]

... basically doing exactly what we do for VMX:

- Passthrough SPEC_CTRL to guests (if enabled in guest CPUID)
- Save and restore SPEC_CTRL around VMExit and VMEntry only if the guest
actually used it.

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: kvm@vger.kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Link: https://lkml.kernel.org/r/1517669783-20732-1-git-send-email-karahmed@amazon.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b2ac58f90540e39324e7a29a7ad471407ae0bf48)

Orabug: 28069548

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/svm.c
Contextual and also we dropped msr_write_intercepted because we do not use it
(we have other logic for IBRS usage). No changes to svm_vcpu_run() because we
support IBRS and we have other code in place.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL - reloaded

This commit is filling out the blanks that were missed in the backport
26a0cd21bb76 ("KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL") due to lack
of different interfaces. 26a0cd21bb76 ("KVM/VMX: Allow direct access to
MSR_IA32_SPEC_CTRL") is basically an incomplet cherry-pick from
d28b387fb74da95d69d2615732f50cceb38e9a4d.

Also added the interception of MSR_IA32_SPEC_CTRL and
MSR_IA32_PRED_CMD in order for the get/set MSR handling to have a sense.

Orabug: 28069548

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

KVM/x86: Add IBPB support

The Indirect Branch Predictor Barrier (IBPB) is an indirect branch
control mechanism. It keeps earlier branches from influencing
later ones.

Unlike IBRS and STIBP, IBPB does not define a new mode of operation.
It's a command that ensures predicted branch targets aren't used after
the barrier. Although IBRS and IBPB are enumerated by the same CPUID
enumeration, IBPB is very different.

IBPB helps mitigate against three potential attacks:

* Mitigate guests from being attacked by other guests.
  - This is addressed by issing IBPB when we do a guest switch.

* Mitigate attacks from guest/ring3->host/ring3.
  These would require a IBPB during context switch in host, or after
  VMEXIT. The host process has two ways to mitigate
  - Either it can be compiled with retpoline
  - If its going through context switch, and has set !dumpable then
    there is a IBPB in that path.
    (Tim's patch: https://patchwork.kernel.org/patch/10192871)
  - The case where after a VMEXIT you return back to Qemu might make
    Qemu attackable from guest when Qemu isn't compiled with retpoline.
  There are issues reported when doing IBPB on every VMEXIT that resulted
  in some tsc calibration woes in guest.

* Mitigate guest/ring0->host/ring0 attacks.
  When host kernel is using retpoline it is safe against these attacks.
  If host kernel isn't using retpoline we might need to do a IBPB flush on
  every VMEXIT.

Even when using retpoline for indirect calls, in certain conditions 'ret'
can use the BTB on Skylake-era CPUs. There are other mitigations
available like RSB stuffing/clearing.

* IBPB is issued only for SVM during svm_free_vcpu().
  VMX has a vmclear and SVM doesn't.  Follow discussion here:
  https://lkml.org/lkml/2018/1/15/146

Please refer to the following spec for more details on the enumeration
and control.

Refer here to get documentation about mitigations.

https://software.intel.com/en-us/side-channel-security-support

[peterz: rebase and changelog rewrite]
[karahmed: - rebase
           - vmx: expose PRED_CMD if guest has it in CPUID
           - svm: only pass through IBPB if guest has it in CPUID
           - vmx: support !cpu_has_vmx_msr_bitmap()]
           - vmx: support nested]
[dwmw2: Expose CPUID bit too (AMD IBPB only for now as we lack IBRS)
        PRED_CMD is a write-only MSR]

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: kvm@vger.kernel.org
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Link: http://lkml.kernel.org/r/1515720739-43819-6-git-send-email-ashok.raj@intel.com
Link: https://lkml.kernel.org/r/1517522386-18410-3-git-send-email-karahmed@amazon.de
(cherry picked from commit 15d45071523d89b3fb7372e2135fbd72f6af9506)

Orabug: 28069548

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/cpuid.c
arch/x86/kvm/svm.c
arch/x86/kvm/vmx.c

All the conflicts were contextual. Major differences in the code between UEK4
and upstream (also in UEK4 we only have the feature IBRS, not SPEC_CTRL). We
had to introduce guest_cpuid_has_* functions in cpuid.h for each feature. Also
moved defines in cpuid.h that were needed in cpuid.h and cpuid.c.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

KVM: x86: pass host_initiated to functions that read MSRs

SMBASE is only readable from SMM for the VCPU, but it must be always
accessible if userspace is accessing it. Thus, all functions that
read MSRs are changed to accept a struct msr_data; the host_initiated
and index fields are pre-initialized, while the data field is filled
on return.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 609e36d372ad9329269e4a1467bd35311893d1d6)

Orabug: 28069548

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

KVM: VMX: make MSR bitmaps per-VCPU

Place the MSR bitmap in struct loaded_vmcs, and update it in place
every time the x2apic or APICv state can change. This is rare and
the loop can handle 64 MSRs per iteration, in a similar fashion as
nested_vmx_prepare_msr_bitmap.

This prepares for choosing, on a per-VM basis, whether to intercept
the SPEC_CTRL and PRED_CMD MSRs.

Cc: stable@vger.kernel.org # prereq for Spectre mitigation
Suggested-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry-picked from 904e14fb7cb96401a7dc803ca2863fd5ba32ffe6)

Orabug: 28069548

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual - different content. Also vmx_enable_intercept_for_msr was already
in UEK4 as part of commit 8d14695f9542e9e0195d6e41ddaa52c32322adf5. We just
changed the signature.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

KVM: VMX: introduce alloc_loaded_vmcs

Group together the calls to alloc_vmcs and loaded_vmcs_init. Soon we'll also
allocate an MSR bitmap there.

Cc: stable@vger.kernel.org # prereq for Spectre mitigation
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry-picked from f21f165ef922c2146cc5bdc620f542953c41714b)

Orabug: 28069548

Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual

Signed-off-by: Brian Maly <brian.maly@oracle.com>