Some code does not mind if the master is bond or team and treats them
the same, as generic LAG.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 7be61833042e7757745345eedc7b0efee240c189) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Similar to other helpers, caller can use this to find out if device is
team port.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit f7f019ee6d117de5007d0b10e7960696bbf111eb) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Similar to other helpers, caller can use this to find out if device is
team master.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c981e4213e9d2d4ec79501bd607722ec712742a2) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/team/team.c
include/linux/netdevice.h
Fred Herard [Mon, 11 Mar 2019 22:41:43 +0000 (18:41 -0400)]
scsi: scsi_transport_iscsi: redirect conn error to console
This commit changes "detected conn error" printk
log level from KERN_INFO to KERN_WARNING. This change
is made with the assumption that KERN_WARNING messages
are configured to be redirected to console. It is
particularly useful to have detected connection errors
redirected to console when using iscsi boot device as it
may give clues as to why the system appears to be hung.
Signed-off-by: Fred Herard <fred.herard@oracle.com> Reviewed-by: Allen Pais <allen.pais@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Mridula Shastry [Wed, 6 Mar 2019 21:29:55 +0000 (13:29 -0800)]
Revert x86/apic/x2apic: set affinity of a single interrupt to one cpu
The commit 092aa78c11f0
("x86/apic/x2apic: set affinity of a single interrupt to one cpu")
was causing performance regression on block storage server on X5.
On OCI X5 server, they were not binding irqs to CPUs 1:1,
irq to cpu affinity was set to multiple cpus
(/proc/$irq/smp_affinity: 00,003ffff0,0003ffff, cpu0-17 and 36-53).
This is not the default behavior of bnxt_en. From bnxt_en,
driver, when NIC link is up, it sets irq affinity, OCI assumed that
most of the bnxt_en interrupts will go to cpu3.
After the patch "x86/apic/x2apic:
set affinity of a single interrupt to one cpu",
if we set irq to cpu 1:1, it works fine, but if we set irq affinity
to multiple cpus, it only sets irq_cfg->domain/cpumask to the first
online cpu which is on the cpu affinity list. With the current setting
which caused the perf issue, although /proc/$irq/smp_affinity is set
to multiple cpus, irq_cfg->domain cpumask only has cpu 0, this lead all
ens4f0-TxRx interrupts to route to cpu0, also iscsi target application
was being run on CPU0 during the testing which led to the performace issue.
The issue is no longer seen after the patch was reverted.
Orabug: 29449976 Signed-off-by: Mridula Shastry <mridula.c.shastry@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Dongli Zhang [Thu, 7 Mar 2019 03:19:32 +0000 (11:19 +0800)]
jiffies: use jiffies64_to_nsecs() to fix 100% steal usage for xen vcpu hotplug
[ Not relevant upstream, therefore no upstream commit. ]
To fix, use jiffies64_to_nsecs() directly instead of deriving the result
according to jiffies_to_usecs().
As the return type of jiffies_to_usecs() is 'unsigned int', when the return
value is more than the size of 'unsigned int', the leading 32 bits would be
discarded.
Suppose USEC_PER_SEC=1000000L and HZ=1000, below are the expected and
actual incorrect result of jiffies_to_usecs(0x7770ef70):
After xen vcpu hotplug and when the new vcpu steal clock is calculated for
the first time, the result of this_rq()->prev_steal_time in
steal_account_process_tick() would be far smaller than the expected
value, due to that jiffies_to_usecs() discards the leading 32 bits.
As a result, the diff between current steal and this_rq()->prev_steal_time
is always very large. Steal usage would become 100% when the initial steal
clock obtained from xen hypervisor is very large during xen vcpu hotplug,
that is, when the guest is already up for a long time.
The bug can be detected by doing the following:
* Boot xen guest with vcpus=2 and maxvcpus=4
* Leave the guest running for a month so that the initial steal clock for
the new vcpu would be very large
* Hotplug 2 extra vcpus
* The steal time of new vcpus in /proc/stat would increase abnormally and
sometimes steal usage in top can become 100%
This was incidentally fixed in the patch set starting by
commit 93825f2ec736 ("jiffies: Reuse TICK_NSEC instead of NSEC_PER_JIFFY")
and ended with
commit b672592f0221 ("sched/cputime: Remove generic asm headers").
Link: https://lkml.org/lkml/2019/2/28/1373 Suggested-by: Juergen Gross <jgross@suse.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Si-Wei Liu [Thu, 17 Jan 2019 23:44:48 +0000 (18:44 -0500)]
net_failover: delay taking over primary device to accommodate udevd renaming
There is an inherent race with udev rename in userspace due to the exposure
of two lower slave devices while kernel attempts to manage the creation
for failover bonding itself automatically. The existing userspace naming
logic in udevd was not specifically written for this in-kernel
automagic.
The clean fix for the problem is either to update the udevd to not try
rename the 3-netdev (ideally rename the device in a coordinated manner),
or to fix the kernel to hide the 2 lower devices which does not have to
be shown to userspace unless needed (1-netdev model).
However, our pursuance of 1-netdev model had not been acknowledged by
upstream, and there's no motivation in the systemd/udevd community at
this point to refactor the rename logic and make it work well with
3-netdev.
Hyper-V's netvsc mitigated this by postponing the VF's dev_open() to
allow a userspace thread to rename the device within a 100ms worth of
window. For the interim, we follow the same as done by netvsc to avoid
the renaming failure, until we move to the point where a clean solution
is available in upstream.
Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Acked-by: John Haxby <john.haxby@oracle.com> Reviewed-by: Vijay Balakrishna <vijay.balakrishna@oracle.com> Tested-by: Vijay Balakrishna <vijay.balakrishna@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
As previously reported (https://patchwork.kernel.org/patch/8642031/)
it's possible to call shrink_dentry_list with a large number of dentries
(> 10000). This, in turn, could trigger the softlockup detector and
possibly trigger a panic. In addition to the unmount path being
vulnerable to this scenario, at SuSE we've observed similar situation
happening during process exit on processes that touch a lot of dentries.
Here is an excerpt from a crash dump. The number after the colon are
the number of dentries on the list passed to shrink_dentry_list:
While some of the callers of shrink_dentry_list do use cond_resched,
this is not sufficient to prevent softlockups. So just move
cond_resched into shrink_dentry_list from its callers.
David said: I've found hundreds of occurrences of warnings that we emit
when need_resched stays set for a prolonged period of time with the
stack trace that is included in the change log.
Link: http://lkml.kernel.org/r/1521718946-31521-1-git-send-email-nborisov@suse.com Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 32785c0539b7e96f77a14a4f4ab225712665a5a4) Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Conflicts:
fs/dcache.c Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
J. Bruce Fields [Tue, 16 Jan 2018 15:08:00 +0000 (10:08 -0500)]
NFS: commit direct writes even if they fail partially
If some of the WRITE calls making up an O_DIRECT write syscall fail,
we neglect to commit, even if some of the WRITEs succeed.
We also depend on the commit code to free the reference count on the
nfs_page taken in the "if (request_commit)" case at the end of
nfs_direct_write_completion(). The problem was originally noticed
because ENOSPC's encountered partway through a write would result in a
closed file being sillyrenamed when it should have been unlinked.
Signed-off-by: J. Bruce Fields <bfields@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
(cherry picked from commit 1b8d97b0a837beaf48a8449955b52c650a7114b4)
Orabug: 28212440 Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com> Reviewed-by: Calum Mackay <calum.mackay@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Mukesh Kacker [Tue, 5 Feb 2019 00:21:28 +0000 (16:21 -0800)]
rds: update correct congestion map for loopback transport
Loopback transport delivers data directly to the destination
socket since destination is always local to the machine.
The connection data structure passed to an internal
initializer API rds_inc_init() is from send side (unlike the
usual call to APIs from receive path when receive side
connection data structure is passed).
This inconsistency causes an update of the incorrect
congestion map when marking destination port congested when
loopback transport is used (which is when one or both end(s)
of a RDS connection has an IP loopback address).
The fix it to ensure correct map is updated, that of the
destination IP regardless of delivery coming from send side
of loopback transport or receive side of other transports.
Theodore Ts'o [Thu, 14 Jun 2018 04:58:00 +0000 (00:58 -0400)]
ext4: only look at the bg_flags field if it is valid
The bg_flags field in the block group descripts is only valid if the
uninit_bg or metadata_csum feature is enabled. We were not
consistently looking at this field; fix this.
Also block group #0 must never have uninitialized allocation bitmaps,
or need to be zeroed, since that's where the root inode, and other
special inodes are set up. Check for these conditions and mark the
file system as corrupted if they are detected.
Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com> Reviewed-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/ext4/balloc.c
fs/ext4/ialloc.c
kernel-ueknano package provides kernel-uek with no version in it. So it
can match any kernel-uek version installed. This commit adds the version
that the rpm is built for in kernel-uek provides also.
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Christoph Paasch [Wed, 27 Sep 2017 00:38:50 +0000 (17:38 -0700)]
net: Set sk_prot_creator when cloning sockets to the right proto
sk->sk_prot and sk->sk_prot_creator can differ when the app uses
IPV6_ADDRFORM (transforming an IPv6-socket to an IPv4-one).
Which is why sk_prot_creator is there to make sure that sk_prot_free()
does the kmem_cache_free() on the right kmem_cache slab.
Now, if such a socket gets transformed back to a listening socket (using
connect() with AF_UNSPEC) we will allocate an IPv4 tcp_sock through
sk_clone_lock() when a new connection comes in. But sk_prot_creator will
still point to the IPv6 kmem_cache (as everything got copied in
sk_clone_lock()). When freeing, we will thus put this
memory back into the IPv6 kmem_cache although it was allocated in the
IPv4 cache. I have seen memory corruption happening because of this.
With slub-debugging and MEMCG_KMEM enabled this gives the warning
"cache_from_obj: Wrong slab cache. TCPv6 but object is from TCP"
A C-program to trigger this:
void main(void)
{
int fd = socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP);
int new_fd, newest_fd, client_fd;
struct sockaddr_in6 bind_addr;
struct sockaddr_in bind_addr4, client_addr1, client_addr2;
struct sockaddr unsp;
int val;
As far as I can see, this bug has been there since the beginning of the
git-days.
Signed-off-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 9d538fa60bad4f7b23193c89e843797a1cf71ef3)
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Regardless of whether the flex_bg feature is set, we should always
check to make sure the bits we are setting in the block bitmap are
within the block group bounds.
Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
It's really bad when the allocation bitmaps and the inode table
overlap with the block group descriptors, since it causes random
corruption of the bg descriptors. So we really want to head those off
at the pass.
Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Fred Herard <fred.herard@oracle.com> Reviewed-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Eric Biggers [Fri, 8 Dec 2017 15:13:27 +0000 (15:13 +0000)]
KEYS: add missing permission check for request_key() destination
When the request_key() syscall is not passed a destination keyring, it
links the requested key (if constructed) into the "default" request-key
keyring. This should require Write permission to the keyring. However,
there is actually no permission check.
This can be abused to add keys to any keyring to which only Search
permission is granted. This is because Search permission allows joining
the keyring. keyctl_set_reqkey_keyring(KEY_REQKEY_DEFL_SESSION_KEYRING)
then will set the default request-key keyring to the session keyring.
Then, request_key() can be used to add keys to the keyring.
Both negatively and positively instantiated keys can be added using this
method. Adding negative keys is trivial. Adding a positive key is a
bit trickier. It requires that either /sbin/request-key positively
instantiates the key, or that another thread adds the key to the process
keyring at just the right time, such that request_key() misses it
initially but then finds it in construct_alloc_key().
Fix this bug by checking for Write permission to the keyring in
construct_get_dest_keyring() when the default keyring is being used.
We don't do the permission check for non-default keyrings because that
was already done by the earlier call to lookup_user_key(). Also,
request_key_and_link() is currently passed a 'struct key *' rather than
a key_ref_t, so the "possessed" bit is unavailable.
We also don't do the permission check for the "requestor keyring", to
continue to support the use case described by commit 8bbf4976b59f
("KEYS: Alter use of key instantiation link-to-keyring argument") where
/sbin/request-key recursively calls request_key() to add keys to the
original requestor's destination keyring. (I don't know of any users
who actually do that, though...)
Fixes: 3e30148c3d52 ("[PATCH] Keys: Make request-key create an authorisation key") Cc: <stable@vger.kernel.org> # v2.6.13+ Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com>
(cherry picked from commit 4dca6ea1d9432052afb06baf2e3ae78188a4410b)
Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
security/keys/request_key.c
David Howells [Mon, 19 Oct 2015 10:20:28 +0000 (11:20 +0100)]
KEYS: Don't permit request_key() to construct a new keyring
If request_key() is used to find a keyring, only do the search part - don't
do the construction part if the keyring was not found by the search. We
don't really want keyrings in the negative instantiated state since the
rejected/negative instantiation error value in the payload is unioned with
keyring metadata.
Now the kernel gives an error:
request_key("keyring", "#selinux,bdekeyring", "keyring", KEY_SPEC_USER_SESSION_KEYRING) = -1 EPERM (Operation not permitted)
Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Håkon Bugge [Wed, 6 Feb 2019 18:28:52 +0000 (19:28 +0100)]
mlx4_ib: Distribute completion vectors when zero is supplied
MAD packet sending/receiving is not properly virtualized in
CX-3. Hence, these are proxied through the PF driver. The proxying
uses UD QPs. The associated CQs are created with completion vector
zero, in anticipation that zero will return the least-used vector, as
per commit 6ba1eb776461 ("IB/mlx4: Scatter CQs to different EQs").
However, this does not happen, and we see that only the first EQ is
used for these proxy QPs.
This leads to great imbalance in CPU processing, in particular during
fail-over and fail-back, when a large number of RDMA CM
requests/responses are proxied through the PF driver.
The current netpoll implementation in the bnxt_en driver has problems
that may miss TX completion events. bnxt_poll_work() in effect is
only handling at most 1 TX packet before exiting. In addition,
there may be in flight TX completions that ->poll() may miss even
after we fix bnxt_poll_work() to handle all visible TX completions.
netpoll may not call ->poll() again and HW may not generate IRQ
because the driver does not ARM the IRQ when the budget (0 for netpoll)
is reached.
We fix it by handling all TX completions and to always ARM the IRQ
when we exit ->poll() with 0 budget.
Also, the logic to ACK the completion ring in case it is almost filled
with TX completions need to be adjusted to take care of the 0 budget
case, as discussed with Eric Dumazet <edumazet@google.com>
Reported-by: Song Liu <songliubraving@fb.com> Reviewed-by: Song Liu <songliubraving@fb.com> Tested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 73f21c653f930f438d53eed29b5e4c65c8a0f906) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Fix bug in the error code path when bnxt_request_irq() returns failure.
bnxt_disable_napi() should not be called in this error path because
NAPI has not been enabled yet.
Fixes: c0c050c58d84 ("bnxt_en: New Broadcom ethernet driver.") Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c58387ab1614f6d7fb9e244f214b61e7631421fc) Signed-off-by: Brian Maly <brian.maly@oracle.com>
A recent change to reduce delay granularity waiting for firmware
reponse has caused a regression. With a tighter delay loop,
the driver may see the beginning part of the response faster.
The original 5 usec delay to wait for the rest of the message
is not long enough and some messages are detected as invalid.
Increase the maximum wait time from 5 usec to 20 usec. Also, fix
the debug message that shows the total delay time for the response
when the message times out. With the new logic, the delay time
is not fixed per iteration of the loop, so we define a macro to
show the total delay time.
Fixes: 9751e8e71487 ("bnxt_en: reduce timeout on initial HWRM calls") Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit cc559c1ac250a6025bd4a9528e424b8da250655b) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.c
Testing with DIM enabled on older kernels indicated that firmware calls
were slower than expected. More detailed analysis indicated that the
default 25us delay was higher than necessary. Reducing the time spend in
usleep_range() for the first several calls would reduce the overall
latency of firmware calls on newer Intel processors.
Signed-off-by: Andy Gospodarek <gospo@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 9751e8e714872aa650b030e52a9fafbb694a3714) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.c
When open fails during ethtool -L ring change, for example, the driver
may crash at bnxt_free_irq() because bp->bnapi is NULL.
If we fail to allocate all the new rings, bnxt_open_nic() will free
all the memory including bp->bnapi. Subsequent call to bnxt_close_nic()
will try to dereference bp->bnapi in bnxt_free_irq().
Fix it by checking for !bp->bnapi in bnxt_free_irq().
Fixes: e5811b8c09df ("bnxt_en: Add IRQ remapping logic.") Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit cb98526bf9b985866d648dbb9c983ba9eb59daba) Signed-off-by: Brian Maly <brian.maly@oracle.com>
During initialization, if we encounter errors, there is a code path that
calls bnxt_hwrm_vnic_set_tpa() with invalid VNIC ID. This may cause a
warning in firmware logs.
Fixes: c0c050c58d84 ("bnxt_en: New Broadcom ethernet driver.") Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 3c4fe80b32c685bdc02b280814d0cfe80d441c72) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Calling bnxt_set_max_func_irqs() to modify the max IRQ count requested or
freed by the RDMA driver is flawed. The max IRQ count is checked when
re-initializing the IRQ vectors and this can happen multiple times
during ifup or ethtool -L. If the max IRQ is reduced and the RDMA
driver is operational, we may not initailize IRQs correctly. This
problem shows up on VFs with very small number of MSIX.
There is no other logic that relies on the IRQ count excluding the ones
used by RDMA. So we fix it by just removing the call to subtract or
add the IRQs used by RDMA.
Fixes: a588e4580a7e ("bnxt_en: Add interface to support RDMA driver.") Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 30f529473ec962102e8bcd33a6a04f1e1b490ae2) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt.h
drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
If all pages are deleted from the mapping by memory reclaim and also
moved to the cleancache:
__delete_from_page_cache
(no shadow case)
unaccount_page_cache_page
cleancache_put_page
page_cache_delete
mapping->nrpages -= nr
(nrpages becomes 0)
We don't clean the cleancache for an inode after final file truncation
(removal).
truncate_inode_pages_final
check (nrpages || nrexceptional) is false
no truncate_inode_pages
no cleancache_invalidate_inode(mapping)
These way when reading the new file created with same inode we may get
these trash leftover pages from cleancache and see wrong data instead of
the contents of the new file.
Fix it by always doing truncate_inode_pages which is already ready for
nrpages == 0 && nrexceptional == 0 case and just invalidates inode.
[akpm@linux-foundation.org: add comment, per Jan] Link: http://lkml.kernel.org/r/20181112095734.17979-1-ptikhomirov@virtuozzo.com Fixes: commit 91b0abe36a7b ("mm + fs: store shadow entries in page cache") Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 29364670
CVE: CVE-2018-16862
Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Jacob Wen [Wed, 30 Jan 2019 06:55:14 +0000 (14:55 +0800)]
l2tp: fix reading optional fields of L2TPv3
Use pskb_may_pull() to make sure the optional fields are in skb linear
parts, so we can safely read them later.
It's easy to reproduce the issue with a net driver that supports paged
skb data. Just create a L2TPv3 over IP tunnel and then generates some
network traffic.
Once reproduced, rx err in /sys/kernel/debug/l2tp/tunnels will increase.
Changes in v4:
1. s/l2tp_v3_pull_opt/l2tp_v3_ensure_opt_in_linear/
2. s/tunnel->version != L2TP_HDR_VER_2/tunnel->version == L2TP_HDR_VER_3/
3. Add 'Fixes' in commit messages.
Changes in v3:
1. To keep consistency, move the code out of l2tp_recv_common.
2. Use "net" instead of "net-next", since this is a bug fix.
Changes in v2:
1. Only fix L2TPv3 to make code simple.
To fix both L2TPv3 and L2TPv2, we'd better refactor l2tp_recv_common.
It's complicated to do so.
2. Reloading pointers after pskb_may_pull
Fixes: f7faffa3ff8e ("l2tp: Add L2TPv3 protocol support") Fixes: 0d76751fad77 ("l2tp: Add L2TPv3 IP encapsulation (no UDP) support") Fixes: a32e0eec7042 ("l2tp: introduce L2TPv3 IP encapsulation support for IPv6") Signed-off-by: Jacob Wen <jian.w.wen@oracle.com> Acked-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 4522a70db7aa5e77526a4079628578599821b193) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
net/l2tp/l2tp_core.c
net/l2tp/l2tp_ip.c
net/l2tp/l2tp_ip6.c
Commit 2b139e6b1ec8 ("l2tp: remove ->recv_payload_hook") is not in UEK5.
l2tp_core.h:
s/l2tp_get_l2specific_len(session)/session->l2specific_len/ due to 62e7b6a57c7b ("l2tp: remove l2specific_len dependency in l2tp_core")
is not in UEK4.
syzbot reported crashes [1] and provided a C repro easing bug hunting.
When/if packet_do_bind() calls __unregister_prot_hook() and releases
po->bind_lock, another thread can run packet_notifier() and process an
NETDEV_UP event.
This calls register_prot_hook() and hooks again the socket right before
first thread is able to grab again po->bind_lock.
Fixes this issue by temporarily setting po->num to 0, as suggested by
David Miller.
Fixes: 30f7ea1c2b5f ("packet: race condition in packet_bind") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Francesco Ruggeri <fruggeri@aristanetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
(cherry picked from commit 72b01bee76591d3741a87a689bf210d729d530f8)
Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com> Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Bart Van Assche [Mon, 18 Feb 2019 02:55:59 +0000 (10:55 +0800)]
blk-mq: Do not invoke .queue_rq() for a stopped queue
The meaning of the BLK_MQ_S_STOPPED flag is "do not call
.queue_rq()". Hence modify blk_mq_make_request() such that requests
are queued instead of issued if a queue has been stopped.
Reported-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
Orabug: 28766011
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
- There are so many commits between the most recent uek4 block commit and
this upstream commit, and blk_mq_make_request() in upstream commit is
different with uek4. blk_mq_direct_issue_request() is not available.
- The 3rd argument of blk_mq_insert_request() is set to false in this
backport because there is no need to run the queue again when it is
already stopped.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Alexander Burmashev [Thu, 7 Feb 2019 15:49:31 +0000 (07:49 -0800)]
uek-rpm: use multi-threaded xz compression for rpms
By default kernel rpms use xz compression for all rpms files and
gzip compression for src.rpms. This commit changes compression type to xz
for all rpms produced, and enables multi-threaded compression for rpms. It
allows to use as many threads as there are CPU cores available.
Signed-off-by: Alex Burmashev <alexander.burmashev@oracle.com> Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Alexander Burmashev [Thu, 7 Feb 2019 15:45:10 +0000 (07:45 -0800)]
uek-rpm: optimize find-requires usage
There are no deps found for debuginfo-common, doc and headers subpackage, so
Autoreq: no is now set for them, also switch from /usr/lib/rpm/redhat/find-requires
usage to /usr/lib/rpm/rpmdeps --requires, since latter is faster and less buggy.
Both changes noticeably boost kernel rpm build time.
Signed-off-by: Alex Burmashev <alexander.burmashev@oracle.com> Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Add a -j <n> option, which, when used, will spawn <n> processes to do the
debuginfo extraction in parallel. A pipe is used to dispatch the files among
the processes.
Signed-off-by: Michal Marek <mmarek@suse.com>
Orabug: 29323635
Signed-off-by: Alex Burmashev <alexander.burmashev@oracle.com> Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Todd Vierling <todd.vierling@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Tom Lendacky [Fri, 23 Feb 2018 23:18:20 +0000 (00:18 +0100)]
KVM: SVM: Add MSR-based feature support for serializing LFENCE
In order to determine if LFENCE is a serializing instruction on AMD
processors, MSR 0xc0011029 (MSR_F10H_DECFG) must be read and the state
of bit 1 checked. This patch will add support to allow a guest to
properly make this determination.
Add the MSR feature callback operation to svm.c and add MSR 0xc0011029
to the list of MSR-based features. If LFENCE is serializing, then the
feature is supported, allowing the hypervisor to set the value of the
MSR that guest will see. Support is also added to write (hypervisor only)
and read the MSR value for the guest. A write by the guest will result in
a #GP. A read by the guest will return the value as set by the host. In
this way, the support to expose the feature to the guest is controlled by
the hypervisor.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
(cherry picked from commit d1d93fa90f1afa926cb060b7f78ab01a65705b4d)
Signed-off-by: John Haxby <john.haxby@oracle.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Kris Van Hees [Sun, 13 Jan 2019 03:10:01 +0000 (22:10 -0500)]
dtrace: support kernels built with RANDOMIZE_BASE
SDT probe addresses were being generated as absolute addresses which
breaks when the kernel may get relocated to a place other than the
default load address.
The solution is to generate the probe locations as an offset relative to
the _stext symbol in the .tmp_sdtinfo.S source file (generated at build
time), so that the actual addresses are processed as relocations when the
kernel boots.
This fix also optimizes the SDT info data (function and probe names) by
using de-duplication since especially with perf probes) many of those
strings are non-unique.
Orabug: 29204005 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Reviewed-by: Nick Alcock <nick.alcock@oracle.com> Tested-by: John Haxby <john.haxby@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Tom Lendacky [Mon, 2 Jul 2018 21:36:02 +0000 (16:36 -0500)]
x86/bugs: Fix the AMD SSBD usage of the SPEC_CTRL MSR
On AMD, the presence of the MSR_SPEC_CTRL feature does not imply that the
SSBD mitigation support should use the SPEC_CTRL MSR. Other features could
have caused the MSR_SPEC_CTRL feature to be set, while a different SSBD
mitigation option is in place.
Update the SSBD support to check for the actual SSBD features that will
use the SPEC_CTRL MSR.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 6ac2f49edb1e ("x86/bugs: Add AMD's SPEC_CTRL MSR usage") Link: http://lkml.kernel.org/r/20180702213602.29202.33151.stgit@tlendack-t1.amdoffice.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 612bc3b3d4be749f73a513a17d9b3ee1330d3487)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
Different filename (bugs_64.c) and different context
Konrad Rzeszutek Wilk [Fri, 1 Jun 2018 14:59:20 +0000 (10:59 -0400)]
x86/bugs: Add AMD's SPEC_CTRL MSR usage
The AMD document outlining the SSBD handling
124441_AMD64_SpeculativeStoreBypassDisable_Whitepaper_final.pdf
mentions that if CPUID 8000_0008.EBX[24] is set we should be using
the SPEC_CTRL MSR (0x48) over the VIRT SPEC_CTRL MSR (0xC001_011f)
for speculative store bypass disable.
This in effect means we should clear the X86_FEATURE_VIRT_SSBD
flag so that we would prefer the SPEC_CTRL MSR.
See the document titled:
124441_AMD64_SpeculativeStoreBypassDisable_Whitepaper_final.pdf
A copy of this document is available at
https://bugzilla.kernel.org/show_bug.cgi?id=199889
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Cc: kvm@vger.kernel.org Cc: KarimAllah Ahmed <karahmed@amazon.de> Cc: andrew.cooper3@citrix.com Cc: Joerg Roedel <joro@8bytes.org> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20180601145921.9500-3-konrad.wilk@oracle.com
(cherry picked from commit 6ac2f49edb1ef5446089c7c660017732886d62d6)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/cpufeatures.h
arch/x86/kernel/cpu/bugs.c
arch/x86/kernel/cpu/common.c
arch/x86/kvm/cpuid.c
arch/x86/kvm/svm.c
The conflicts were due to different filenames (cpufeature.h, bugs_64.c) and different context.
Mihai Carabas [Wed, 7 Nov 2018 07:00:03 +0000 (09:00 +0200)]
x86/cpufeatures: rename X86_FEATURE_AMD_SSBD to X86_FEATURE_LS_CFG_SSBD
The commit 52817587e706 ('x86/cpufeatures: Disentangle SSBD enumeration') from
upstream disentangles SSBD enumeration. We did not backport that commit because
we did not have what to disentangle on UEK4. Our cpufeature was already
synthetic.
That commit also renames X86_FEATURE_AMD_SSBD to X86_FEATURE_LS_CFG_SSBD. We
need this rename in order to not have conflicting cpu features while
backporting commit 6ac2f49edb1e ('x86/bugs: Add AMD's SPEC_CTRL MSR usage')
from upstream which introduces SPEC_CTRL MSR, which will be the prefered
method.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Make file credentials available to the seqfile interfaces
A lot of seqfile users seem to be using things like %pK that uses the
credentials of the current process, but that is actually completely
wrong for filesystem interfaces.
The unix semantics for permission checking files is to check permissions
at _open_ time, not at read or write time, and that is not just a small
detail: passing off stdin/stdout/stderr to a suid application and making
the actual IO happen in privileged context is a classic exploit
technique.
So if we want to be able to look at permissions at read time, we need to
use the file open credentials, not the current ones. Normal file
accesses can just use "f_cred" (or any of the helper functions that do
that, like file_ns_capable()), but the seqfile interfaces do not have
any such options.
It turns out that seq_file _does_ save away the user_ns information of
the file, though. Since user_ns is just part of the full credential
information, replace that special case with saving off the cred pointer
instead, and suddenly seq_file has all the permission information it
needs.
Conflict: Refactored include/linux/seq_file.h to include __GENKSYM__
and UEK_KABI_REPLACE() to pass check_kabi test.
Signed-off-by: John Donnelly <john.p.donnelly@oracle.com> Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Jann Horn [Fri, 5 Oct 2018 22:51:58 +0000 (15:51 -0700)]
proc: restrict kernel stack dumps to root
Currently, you can use /proc/self/task/*/stack to cause a stack walk on
a task you control while it is running on another CPU. That means that
the stack can change under the stack walker. The stack walker does
have guards against going completely off the rails and into random
kernel memory, but it can interpret random data from your kernel stack
as instruction pointers and stack pointers. This can cause exposure of
kernel stack contents to userspace.
Restrict the ability to inspect kernel stacks of arbitrary tasks to root
in order to prevent a local attacker from exploiting racy stack unwinding
to leak kernel task stack contents. See the added comment for a longer
rationale.
There don't seem to be any users of this userspace API that can't
gracefully bail out if reading from the file fails. Therefore, I believe
that this change is unlikely to break things. In the case that this patch
does end up needing a revert, the next-best solution might be to fake a
single-entry stack based on wchan.
Link: http://lkml.kernel.org/r/20180927153316.200286-1-jannh@google.com Fixes: 2ec220e27f50 ("proc: add /proc/*/stack") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Ken Chen <kenchen@google.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit f8a00cef17206ecd1b30d3d9f99e10d9fa707aa7)
Signed-off-by: John Donnelly <john.p.donnelly@oracle.com> Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
include/linux/compiler-gcc.h
include/linux/module.h
UEK4 either implements the changes in different files
or it does not have the patches that introduce the
lines changed by this cherry-picked commit.
Masahiro Yamada [Wed, 5 Dec 2018 06:27:19 +0000 (15:27 +0900)]
x86/build: Fix compiler support check for CONFIG_RETPOLINE
It is troublesome to add a diagnostic like this to the Makefile
parse stage because the top-level Makefile could be parsed with
a stale include/config/auto.conf.
Once you are hit by the error about non-retpoline compiler, the
compilation still breaks even after disabling CONFIG_RETPOLINE.
The easiest fix is to move this check to the "archprepare" like
this commit did:
829fe4aa9ac1 ("x86: Allow generating user-space headers without a compiler")
Reported-by: Meelis Roos <mroos@linux.ee> Tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com> Fixes: 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support") Link: http://lkml.kernel.org/r/1543991239-18476-1-git-send-email-yamada.masahiro@socionext.com Link: https://lkml.org/lkml/2018/12/4/206 Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 25896d073d8a0403b07e6dec56f58e6c33678207)
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/Makefile
The archprepare rule is different in UEK and upstream makefiles
Zhenzhong Duan [Fri, 2 Nov 2018 08:45:41 +0000 (01:45 -0700)]
x86/retpoline: Remove minimal retpoline support
Now that CONFIG_RETPOLINE hard depends on compiler support, there is no
reason to keep the minimal retpoline support around which only provided
basic protection in the assembly files.
Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Borislav Petkov <bp@suse.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: <srinivas.eeda@oracle.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/f06f0a89-5587-45db-8ed2-0a9d6638d5c0@default
(cherry picked from commit ef014aae8f1cd2793e4e014bbb102bed53f852b7)
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
UEK4 has the corresponding code in bugs_64.c
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/Kconfig
arch/x86/include/asm/nospec-branch.h
Minor differences between UEK and upstream.
arch/x86/Makefile
Need to add line defining RETPOLINE_CFLAGS.
arch/x86/kernel/cpu/bugs.c
UEK4 has the corresponding code in bugs_64.c
scripts/Makefile.build
Commit e699314 (objtool: Add retpoline validation) has not
been ported to UEK4, nothing to change.
nl80211: check for the required netlink attributes presence
nl80211_set_rekey_data() does not check if the required attributes
NL80211_REKEY_DATA_{REPLAY_CTR,KEK,KCK} are present when processing
NL80211_CMD_SET_REKEY_OFFLOAD request. This request can be issued by
users with CAP_NET_ADMIN privilege and may result in NULL dereference
and a system crash. Add a check for the required attributes presence.
This patch is based on the patch by bo Zhang.
This fixes CVE-2017-12153.
References: https://bugzilla.redhat.com/show_bug.cgi?id=1491046 Fixes: e5497d766ad ("cfg80211/nl80211: support GTK rekey offload") Cc: <stable@vger.kernel.org> # v3.1-rc1 Reported-by: bo Zhang <zhangbo5891001@gmail.com> Signed-off-by: Vladis Dronov <vdronov@redhat.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
(cherry picked from commit e785fa0a164aa11001cba931367c7f94ffaff888)
Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
lpfc cannot establish connection with targets that send PRLI in P2P
configurations.
If lpfc rejects a PRLI that is sent from a target the target will not
resend and will reject the PRLI send from the initiator.
[tv: original mistakenly applied in reverse, because change was already
present in the code at that point; this reapplies forwards]
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Todd Vierling <todd.vierling@oracle.com> Reviewed-by: Fred Herard <fred.herard@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Mukesh Kacker [Wed, 1 Aug 2018 18:37:01 +0000 (11:37 -0700)]
rds: congestion updates can be missed when kernel low on memory
The congestion updates are allocated under GFP_NOWAIT and can
fail under temporary memory pressure. These are not retried and
the update here retries them until sent.
On receiving congestion updates, corrupt packet check failures
are not logged as warnings.
Temporary memory allocation failures may cause an RDS connection
to be stuck in an endless RNR (receiver Not Ready). Right around the
time the RDS connection becomes stuck, it reports these recv buffer
allocation failures:
This currently doesn't take into account memory allocation failures.
A bit later the memory pressure clears away.
But RDS does not refill receive buffers for that connection any more.
This is because the receiver is only woken up on the last packet of a
multi-packet message. But the last packet is never received, because the
recv queue becomes empty and we end up in the endless RNR Retry situation.
Zhu Yanjun [Fri, 25 Jan 2019 02:14:52 +0000 (21:14 -0500)]
net: rds: fix excess initialization of the recv SGEs
In rds_ib_recv_init_ring(), an excess array element is incorrectly
initialized. This is not an OOB situation, as the sge array is
initialized to eight entries. With a fragment size of a maximum of 16KiB
and a page size of minimum 4KiB, then num_send_sge can at most become
five.
Mathias Nyman [Fri, 11 Dec 2015 12:38:06 +0000 (14:38 +0200)]
xhci: fix usb2 resume timing and races.
According to USB 2 specs ports need to signal resume for at least 20ms,
in practice even longer, before moving to U0 state.
Both host and devices can initiate resume.
On device initiated resume, a port status interrupt with the port in resume
state in issued. The interrupt handler tags a resume_done[port]
timestamp with current time + USB_RESUME_TIMEOUT, and kick roothub timer.
Root hub timer requests for port status, finds the port in resume state,
checks if resume_done[port] timestamp passed, and set port to U0 state.
On host initiated resume, current code sets the port to resume state,
sleep 20ms, and finally sets the port to U0 state. This should also
be changed to work in a similar way as the device initiated resume, with
timestamp tagging, but that is not yet tested and will be a separate
fix later.
There are a few issues with this approach
1. A host initiated resume will also generate a resume event. The event
handler will find the port in resume state, believe it's a device
initiated resume, and act accordingly.
2. A port status request might cut the resume signalling short if a
get_port_status request is handled during the host resume signalling.
The port will be found in resume state. The timestamp is not set leading
to time_after_eq(jiffies, timestamp) returning true, as timestamp = 0.
get_port_status will proceed with moving the port to U0.
3. If an error, or anything else happens to the port during device
initiated resume signalling it will leave all the device resume
parameters hanging uncleared, preventing further suspend, returning
-EBUSY, and cause the pm thread to busyloop trying to enter suspend.
Fix this by using the existing resuming_ports bitfield to indicate that
resume signalling timing is taken care of.
Check if the resume_done[port] is set before using it for timestamp
comparison, and also clear out any resume signalling related variables
if port is not in U0 or Resume state
This issue was discovered when a PM thread busylooped, trying to runtime
suspend the xhci USB 2 roothub on a Dell XPS
(cherry picked from commit f69115fdbc1ac0718e7d19ad3caa3da2ecfe1c96) Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Mathias Nyman [Wed, 18 Nov 2015 08:48:22 +0000 (10:48 +0200)]
xhci: Fix a race in usb2 LPM resume, blocking U3 for usb2 devices
Clear device initiated resume variables once device is fully up and running
in U0 state.
Resume needs to be signaled for 20ms for usb2 devices before they can be
moved to U0 state.
An interrupt is triggered if a device initiates resume. As we handle the
event in interrupt context we can not sleep for 20ms, so we instead set
a resume flag, a timestamp, and start the roothub polling.
The roothub code will later move the port to U0 when it finds a port in
resume state with the resume flag set, and timestamp passed by 20ms.
A host initiated resume is however not done in interrupt context, and
host initiated resume code will directly signal resume, wait 20ms and then
move the port to U0.
These two codepaths can race, if we are in the middle of a host initated
resume, while sleeping for 20ms, we may handle a port event and find the
port in resume state. The port event handling code will assume the resume
was device initiated and set the resume flag and timestamp.
Root hub code will however not catch the port in resume state again as the
host initated resume code has already moved the port to U0.
The resume flag and timestamp will remain set for this port preventing port
from suspending again (LPM setting port to U3)
Fix this for now by always clearing the device initated resume parameters
once port is in U0
(cherry picked from commit dad67d5f3d0efe01d38c6cebcb6698280e51927b) Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/usb/host/xhci-hub.c
Andrea Arcangeli [Fri, 14 Dec 2018 22:17:17 +0000 (14:17 -0800)]
userfaultfd: check VM_MAYWRITE was set after verifying the uffd is registered
Calling UFFDIO_UNREGISTER on virtual ranges not yet registered in uffd
could trigger an harmless false positive WARN_ON. Check the vma is
already registered before checking VM_MAYWRITE to shut off the false
positive warning.
Link: http://lkml.kernel.org/r/20181206212028.18726-2-aarcange@redhat.com Cc: <stable@vger.kernel.org> Fixes: 29ec90660d68 ("userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: syzbot+06c7092e7d71218a2c16@syzkaller.appspotmail.com Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 29163750
CVE: CVE-2018-18397
Andrea Arcangeli [Fri, 30 Nov 2018 22:09:32 +0000 (14:09 -0800)]
userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas
After the VMA to register the uffd onto is found, check that it has
VM_MAYWRITE set before allowing registration. This way we inherit all
common code checks before allowing to fill file holes in shmem and
hugetlbfs with UFFDIO_COPY.
The userfaultfd memory model is not applicable for readonly files unless
it's a MAP_PRIVATE.
Link: http://lkml.kernel.org/r/20181126173452.26955-4-aarcange@redhat.com Fixes: ff62a3421044 ("hugetlb: implement memfd sealing") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Hugh Dickins <hughd@google.com> Reported-by: Jann Horn <jannh@google.com> Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Cc: <stable@vger.kernel.org> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 29163750
CVE: CVE-2018-18397
Jianchao Wang [Wed, 9 Jan 2019 08:20:24 +0000 (03:20 -0500)]
x86/apic/x2apic: set affinity of a single interrupt to one cpu
Customer want to offline the cpus to 2 per node. And finally, a
lpfc HBA cannot work any more due to no available
irq vectors.
[ 51.031812] IRQ 284 set affinity failed because there are no available vectors. The device assigned to this IRQ is unstable.
[ 51.031817] IRQ 285 set affinity failed because there are no available vectors. The device assigned to this IRQ is unstable.
[ 51.031822] IRQ 286 set affinity failed because there are no available vectors. The device assigned to this IRQ is unstable.
[ 51.031827] IRQ 287 set affinity failed because there are no available vectors. The device assigned to this IRQ is unstable.
It was due to cluster_vector_allocation_domain which want to set
interrupt affinity of a single interrupt to multiple CPUs and need
a same irq vector to be available on multiple cpus. This is difficult
for customer's case where there are a lot of HBAs on node 0 and only
2 or 4 cpus online there.
And actually, this feature has been discarded by the upstream.
https://lkml.org/lkml/2017/9/13/576
We close this feature by just set one cpu in retmask in
cluster_vector_allocation_domain.
Customer that encountered this issue used RHCK, since UEK4 also
has the same code, post a same patch for UEK4
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Jianchao Wang <jianchao.w.wang.oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Suggested-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Reviewed-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Dongli Zhang [Wed, 23 Jan 2019 07:47:33 +0000 (15:47 +0800)]
xen/blkback: do not BUG() for invalid blkif_request from frontend
Upstream commit 0e367ae46503 ("xen/blkback: correctly respond to unknown,
non-native requests") fixed a bug to correctly respond to unknown,
non-native requests, e.g., BLKIF_OP_RESERVED_1 or BLKIF_OP_PACKET for
64-bit SLES 11 guests when using a 32-bit backend.
Although such fix is already in uek4, it is broken by commit f0af2f840606
("xen-blkback: move indirect req allocation out-of-line") that introduced
the BUG() again.
This patch removes the BUG() to avoid panic backend by invalid
blkif_request from frontend.
commit 041dc3e4d3
("Backport multipath RDS from upstream to UEK4") treats an
incoming rds ping or rds pong differently if the local (in case of pong) or
sender's port (in case of ping) is 1 (RDS_FLAG_PROBE_PORT).
There is nothing stopping rds-ping from picking this port for it's local side
since it does wildcard socket bind.
Dongli Zhang [Mon, 21 Jan 2019 01:56:55 +0000 (09:56 +0800)]
xen-netback: wake up xenvif_dealloc_kthread when it should stop
The feature 'staging grant' changed the behaviour of
xenvif_zerocopy_callback() that queue->dealloc_prod may not increase during
the do-while loop because of 'staging grant'. As a result,
xenvif_skb_zerocopy_complete() would not wake up xenvif_dealloc_kthread
because (prod == queue->dealloc_prod).
This makes trouble when the xenvif_dealloc_kthread is requested to stop by
xenvif_disconnect(). When xenvif_dealloc_kthread is stopped while
inflight_packets is not 0, xenvif_dealloc_kthread would not exit until
inflight_packets becomes 0.
However, because of 'staging grant', xenvif_skb_zerocopy_complete() would
not wake up xenvif_dealloc_kthread() although inflight_packets is
decremented and already becomes 0. As a result, xenvif_dealloc_kthread will
never wakes up.
xenvif_skb_zerocopy_complete() should wake up xenvif_dealloc_kthread when
the latter is in the progress to stop.
Fixes: fdbb2e3659b3 ("xen-netback: use gref mappings for Tx requests") Reported-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
Files are relocated from 'kernel' directory to
'drivers/net/ethernet/mellanox/mstflint_access'
Signed-off-by: Qing Huang <qing.huang@oracle.com> Reviewed-by: Aron Silverton <aron.silverton@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
Files are relocated from 'kernel' directory to
'drivers/net/ethernet/mellanox/mstflint_access'
Signed-off-by: Qing Huang <qing.huang@oracle.com> Reviewed-by: Aron Silverton <aron.silverton@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
Files are relocated from 'kernel' directory to
'drivers/net/ethernet/mellanox/mstflint_access'
Squashed two commits since the 1st commit has a compilation
issue which is fixed by the 2nd commit.
Signed-off-by: Qing Huang <qing.huang@oracle.com> Reviewed-by: Aron Silverton <aron.silverton@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
Files are relocated from 'kernel' directory to
'drivers/net/ethernet/mellanox/mstflint_access'
Signed-off-by: Qing Huang <qing.huang@oracle.com> Reviewed-by: Aron Silverton <aron.silverton@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
Files are relocated from 'kernel' directory to
'drivers/net/ethernet/mellanox/mstflint_access'
Signed-off-by: Qing Huang <qing.huang@oracle.com> Reviewed-by: Aron Silverton <aron.silverton@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Previously the driver used pcie_get_minimum_link() to warn when the NIC
is in a slot that can't supply as much bandwidth as the NIC could use.
pcie_get_minimum_link() can be misleading because it finds the slowest link
and the narrowest link (which may be different links) without considering
the total bandwidth of each link. For a path with a 16 GT/s x1 link and a
2.5 GT/s x16 link, it returns 2.5 GT/s x1, which corresponds to 250 MB/s of
bandwidth, not the true available bandwidth of about 1969 MB/s for a
16 GT/s x1 link.
Use pcie_print_link_status() to report PCIe link speed and possible
limitations instead of implementing this in the driver itself. This finds
the slowest link in the path to the device by computing the total bandwidth
of each link and compares that with the capabilities of the device.
Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
[backport of UEK5 commit 48b32a7f2b4dddafbf42cde882c3c84c556fb477] Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/ethernet/broadcom/bnxt/bnxt_compat.c
drivers/net/ethernet/broadcom/bnxt/bnxt_compat.h
Eric W. Biederman [Mon, 2 Oct 2017 14:38:20 +0000 (09:38 -0500)]
selinux: Perform both commoncap and selinux xattr checks
When selinux is loaded the relax permission checks for writing
security.capable are not honored. Which keeps file capabilities
from being used in user namespaces.
Stephen Smalley <sds@tycho.nsa.gov> writes:
> Originally SELinux called the cap functions directly since there was no
> stacking support in the infrastructure and one had to manually stack a
> secondary module internally. inode_setxattr and inode_removexattr
> however were special cases because the cap functions would check
> CAP_SYS_ADMIN for any non-capability attributes in the security.*
> namespace, and we don't want to impose that requirement on setting
> security.selinux. Thus, we inlined the capabilities logic into the
> selinux hook functions and adapted it appropriately.
Now that the permission checks in commoncap have evolved this
inlining of their contents has become a problem. So restructure
selinux_inode_removexattr, and selinux_inode_setxattr to call
both the corresponding cap_inode_ function and dentry_has_perm
when the attribute is not a selinux security xattr. This ensures
the policies of both commoncap and selinux are enforced.
This results in smack and selinux having the same basic structure
for setxattr and removexattr. Performing their own special permission
checks when it is their modules xattr being written to, and deferring
to commoncap when that is not the case. Then finally performing their
generic module policy on all xattr writes.
This structure is fine when you only consider stacking with the
commoncap lsm, but it becomes a problem if two lsms that don't want
the commoncap security checks on their own attributes need to be
stack. This means there will need to be updates in the future as lsm
stacking is improved, but at least now the structure between smack and
selinux is common making the code easier to refactor.
This change also has the effect that selinux_linux_setotherxattr becomes
unnecessary so it is removed.
Fixes: 8db6c34f1dbc ("Introduce v3 namespaced file capabilities") Fixes: 7bbf0e052b76 ("[PATCH] selinux merge")
Historical Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Paul Moore <paul@paul-moore.com>
(cherry picked from commit 6b240306ee1631587a87845127824df54a0a5abe)
Signed-off-by: Gayatri Vasudevan <gayatri.vasudevan@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
security/selinux/hooks.c
Serge E. Hallyn [Mon, 8 May 2017 18:11:56 +0000 (13:11 -0500)]
Introduce v3 namespaced file capabilities
Root in a non-initial user ns cannot be trusted to write a traditional
security.capability xattr. If it were allowed to do so, then any
unprivileged user on the host could map his own uid to root in a private
namespace, write the xattr, and execute the file with privilege on the
host.
However supporting file capabilities in a user namespace is very
desirable. Not doing so means that any programs designed to run with
limited privilege must continue to support other methods of gaining and
dropping privilege. For instance a program installer must detect
whether file capabilities can be assigned, and assign them if so but set
setuid-root otherwise. The program in turn must know how to drop
partial capabilities, and do so only if setuid-root.
This patch introduces v3 of the security.capability xattr. It builds a
vfs_ns_cap_data struct by appending a uid_t rootid to struct
vfs_cap_data. This is the absolute uid_t (that is, the uid_t in user
namespace which mounted the filesystem, usually init_user_ns) of the
root id in whose namespaces the file capabilities may take effect.
When a task asks to write a v2 security.capability xattr, if it is
privileged with respect to the userns which mounted the filesystem, then
nothing should change. Otherwise, the kernel will transparently rewrite
the xattr as a v3 with the appropriate rootid. This is done during the
execution of setxattr() to catch user-space-initiated capability writes.
Subsequently, any task executing the file which has the noted kuid as
its root uid, or which is in a descendent user_ns of such a user_ns,
will run the file with capabilities.
Similarly when asking to read file capabilities, a v3 capability will
be presented as v2 if it applies to the caller's namespace.
If a task writes a v3 security.capability, then it can provide a uid for
the xattr so long as the uid is valid in its own user namespace, and it
is privileged with CAP_SETFCAP over its namespace. The kernel will
translate that rootid to an absolute uid, and write that to disk. After
this, a task in the writer's namespace will not be able to use those
capabilities (unless rootid was 0), but a task in a namespace where the
given uid is root will.
Only a single security.capability xattr may exist at a time for a given
file. A task may overwrite an existing xattr so long as it is
privileged over the inode. Note this is a departure from previous
semantics, which required privilege to remove a security.capability
xattr. This check can be re-added if deemed useful.
This allows a simple setxattr to work, allows tar/untar to work, and
allows us to tar in one namespace and untar in another while preserving
the capability, without risking leaking privilege into a parent
namespace.
A patch to linux-test-project adding a new set of tests for this
functionality is in the nsfscaps branch at github.com/hallyn/ltp
Changelog:
Nov 02 2016: fix invalid check at refuse_fcap_overwrite()
Nov 07 2016: convert rootid from and to fs user_ns
(From ebiederm: mar 28 2017)
commoncap.c: fix typos - s/v4/v3
get_vfs_caps_from_disk: clarify the fs_ns root access check
nsfscaps: change the code split for cap_inode_setxattr()
Apr 09 2017:
don't return v3 cap for caps owned by current root.
return a v2 cap for a true v2 cap in non-init ns
Apr 18 2017:
. Change the flow of fscap writing to support s_user_ns writing.
. Remove refuse_fcap_overwrite(). The value of the previous
xattr doesn't matter.
Apr 24 2017:
. incorporate Eric's incremental diff
. move cap_convert_nscap to setxattr and simplify its usage
May 8, 2017:
. fix leaking dentry refcount in cap_inode_getsecurity
Signed-off-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
(cherry picked from commit 8db6c34f1dbc8e06aa016a9b829b06902c3e1340)
Signed-off-by: Gayatri Vasudevan <gayatri.vasudevan@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
security/commoncap.c
Håkon Bugge [Wed, 2 Jan 2019 13:59:35 +0000 (14:59 +0100)]
rds: ib: Use a delay when reconnecting to the very same IP address
An RDS IB connection may be formed from the very same IB port using
HCA level internal loop-back. If this connection attempt is performed
after RDS has cleared the ARP cache of the same IP address, an ARP IB
multicast is sent out on the IPoIB interface.
If the above scenario is performed on IPoIB interfaces that are
members of an IB Limited Partition, the ARP multicast will be dropped
by the HCA port. A corresponding PKey Violation is counted and a
corresponding PKey Violation Trap is sent to the OpenSM, subject to
rate control.
Now, due to a bug in RDS connection management, where it was not
anticipated that the peers of a connection could actually be the very
same port and have the same IP address, the reconnect attempts happens
with zero delay.
This leads to about 7700 connection attempts per second, about
4400 PKey Violations per second, and 8500 ARP multicasts per second.
This commit reduces the reconnect rate down to one second. This
because the RDS uses exponential backoff to calculate the delay, which
will shortly end up at rds_sysctl_reconnect_max_jiffies, which by
default is HZ, in other words, a delay at one second after the 10
first reconnects.
Linus Torvalds [Sun, 6 Jan 2019 01:50:59 +0000 (17:50 -0800)]
Change mincore() to count "mapped" pages rather than "cached" pages
The semantics of what "in core" means for the mincore() system call are
somewhat unclear, but Linux has always (since 2.3.52, which is when
mincore() was initially done) treated it as "page is available in page
cache" rather than "page is mapped in the mapping".
The problem with that traditional semantic is that it exposes a lot of
system cache state that it really probably shouldn't, and that users
shouldn't really even care about.
So let's try to avoid that information leak by simply changing the
semantics to be that mincore() counts actual mapped pages, not pages
that might be cheaply mapped if they were faulted (note the "might be"
part of the old semantics: being in the cache doesn't actually guarantee
that you can access them without IO anyway, since things like network
filesystems may have to revalidate the cache before use).
In many ways the old semantics were somewhat insane even aside from the
information leak issue. From the very beginning (and that beginning is
a long time ago: 2.3.52 was released in March 2000, I think), the code
had a comment saying
Later we can get more picky about what "in core" means precisely.
and this is that "later". Admittedly it is much later than is really
comfortable.
NOTE! This is a real semantic change, and it is for example known to
change the output of "fincore", since that program literally does a
mmmap without populating it, and then doing "mincore()" on that mapping
that doesn't actually have any pages in it.
I'm hoping that nobody actually has any workflow that cares, and the
info leak is real.
We may have to do something different if it turns out that people have
valid reasons to want the old semantics, and if we can limit the
information leak sanely.
Cc: Kevin Easton <kevin@guarana.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Masatake YAMATO <yamato@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 574823bfab82d9d8fa47f422778043fbb4b4f50e)
Orabug: 29187415
CVE: CVE-2019-5489 Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: John Donnelly <John.P.Donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
mm/mincore.c
Kinglong Mee [Thu, 30 Jul 2015 13:55:02 +0000 (21:55 +0800)]
NFSD: Set the attributes used to store the verifier for EXCLUSIVE4_1
According to rfc5661 18.16.4,
"If EXCLUSIVE4_1 was used, the client determines the attributes
used for the verifier by comparing attrset with cva_attrs.attrmask;"
So, EXCLUSIVE4_1 also needs those bitmask used to store the verifier.
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Orabug: 29204157
(cherry picked from commit ead8fb8c24411722b92198b3dccd102a76cdd050) Signed-off-by: Calum Mackay <calum.mackay@oracle.com> Reviewed-by: Bill Baker <Bill.Baker@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
This patch is a helper for back porting upstream commit 45d8ec4d9fd5
(ext4: update i_disksize if direct write past ondisk size), add a condition
to allow updating i_disksize through calling ext4_ind_direct_IO when the new
eof exceeds both i_size and i_disksize.
Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Ashish Samant <ashish.samant@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Eryu Guan [Thu, 22 Mar 2018 15:44:59 +0000 (11:44 -0400)]
ext4: update i_disksize if direct write past ondisk size
Currently in ext4 direct write path, we update i_disksize only when
new eof is greater than i_size, and don't update it even when new
eof is greater than i_disksize but less than i_size. This doesn't
work well with delalloc buffer write, which updates i_size and
i_disksize only when delalloc blocks are resolved (at writeback
time), the i_disksize from direct write can be lost if a previous
buffer write succeeded at write time but failed at writeback time,
then results in corrupted ondisk inode size.
Consider this case, first buffer write 4k data to a new file at
offset 16k with delayed allocation, then direct write 4k data to the
same file at offset 4k before delalloc blocks are resolved, which
doesn't update i_disksize because it writes within i_size(20k), but
the extent tree metadata has been committed in journal. Then
writeback of the delalloc blocks fails (due to device error etc.),
and i_size/i_disksize from buffer write can't be written to disk
(still zero). A subsequent umount/mount cycle recovers journal and
writes extent tree metadata from direct write to disk, but with
i_disksize being zero.
Fix it by updating i_disksize too in direct write path when new eof
is greater than i_disksize but less than i_size, so i_disksize is
always consistent with direct write.
This fixes occasional i_size corruption in fstests generic/475.
Eryu Guan [Thu, 22 Mar 2018 15:41:25 +0000 (11:41 -0400)]
ext4: protect i_disksize update by i_data_sem in direct write path
i_disksize update should be protected by i_data_sem, by either taking
the lock explicitly or by using ext4_update_i_disksize() helper. But the
i_disksize updates in ext4_direct_IO_write() are not protected at all,
which may be racing with i_disksize updates in writeback path in
delalloc buffer write path.
This is found by code inspection, and I didn't hit any i_disksize
corruption due to this bug. Thanks to Jan Kara for catching this bug and
suggesting the fix!
Reported-by: Jan Kara <jack@suse.cz> Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
Orabug: 28940828
Hui Peng [Mon, 3 Dec 2018 15:09:34 +0000 (16:09 +0100)]
ALSA: usb-audio: Fix UAF decrement if card has no live interfaces in card.c
If a USB sound card reports 0 interfaces, an error condition is triggered
and the function usb_audio_probe errors out. In the error path, there was a
use-after-free vulnerability where the memory object of the card was first
freed, followed by a decrement of the number of active chips. Moving the
decrement above the atomic_dec fixes the UAF.
[ The original problem was introduced in 3.1 kernel, while it was
developed in a different form. The Fixes tag below indicates the
original commit but it doesn't mean that the patch is applicable
cleanly. -- tiwai ]
Fixes: 362e4e49abe5 ("ALSA: usb-audio - clear chip->probing on error exit") Reported-by: Hui Peng <benquike@gmail.com> Reported-by: Mathias Payer <mathias.payer@nebelwelt.net> Signed-off-by: Hui Peng <benquike@gmail.com> Signed-off-by: Mathias Payer <mathias.payer@nebelwelt.net> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
Orabug: 29042981
CVE: CVE-2018-19824
(cherry picked from commit 5f8cf712582617d523120df67d392059eaf2fc4b) Signed-off-by: Dan Duval <dan.duval@oracle.com> Reviewed-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Takashi Iwai [Wed, 26 Aug 2015 08:20:59 +0000 (10:20 +0200)]
ALSA: usb-audio: Replace probing flag with active refcount
We can use active refcount for preventing autopm during probe.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Orabug: 29042981
CVE: CVE-2018-19824
(cherry picked from commit a6da499b76b1a75412f047ac388e9ffd69a5c55b) Signed-off-by: Dan Duval <dan.duval@oracle.com> Reviewed-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Takashi Iwai [Tue, 25 Aug 2015 14:09:00 +0000 (16:09 +0200)]
ALSA: usb-audio: Avoid nested autoresume calls
After the recent fix of runtime PM for USB-audio driver, we got a
lockdep warning like:
=============================================
[ INFO: possible recursive locking detected ]
4.2.0-rc8+ #61 Not tainted
---------------------------------------------
pulseaudio/980 is trying to acquire lock:
(&chip->shutdown_rwsem){.+.+.+}, at: [<ffffffffa0355dac>] snd_usb_autoresume+0x1d/0x52 [snd_usb_audio]
but task is already holding lock:
(&chip->shutdown_rwsem){.+.+.+}, at: [<ffffffffa0355dac>] snd_usb_autoresume+0x1d/0x52 [snd_usb_audio]
This comes from snd_usb_autoresume() invoking down_read() and it's
used in a nested way. Although it's basically safe, per se (as these
are read locks), it's better to reduce such spurious warnings.
The read lock is needed to guarantee the execution of "shutdown"
(cleanup at disconnection) task after all concurrent tasks are
finished. This can be implemented in another better way.
Also, the current check of chip->in_pm isn't good enough for
protecting the racy execution of multiple auto-resumes.
This patch rewrites the logic of snd_usb_autoresume() & co; namely,
- The recursive call of autopm is avoided by the new refcount,
chip->active. The chip->in_pm flag is removed accordingly.
- Instead of rwsem, another refcount, chip->usage_count, is introduced
for tracking the period to delay the shutdown procedure. At
the last clear of this refcount, wake_up() to the shutdown waiter is
called.
- The shutdown flag is replaced with shutdown atomic count; this is
for reducing the lock.
- Two new helpers are introduced to simplify the management of these
refcounts; snd_usb_lock_shutdown() increases the usage_count, checks
the shutdown state, and does autoresume. snd_usb_unlock_shutdown()
does the opposite. Most of mixer and other codes just need this,
and simply returns an error if it receives an error from lock.
Fixes: 9003ebb13f61 ('ALSA: usb-audio: Fix runtime PM unbalance') Reported-and-tested-by: Alexnader Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
Orabug: 29042981
CVE: CVE-2018-19824
(cherry picked from commit 47ab154593827b1a8f0713a2b9dd445753d551d8) Signed-off-by: Dan Duval <dan.duval@oracle.com> Reviewed-by: Brian Maly <brian.maly@oracle.com>
Conflict:
Signed-off-by: John Donnelly <john.p.donnelly@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: John Donnelly <john.p.donnelly@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Theodore Ts'o [Fri, 30 Mar 2018 02:10:31 +0000 (22:10 -0400)]
ext4: always initialize the crc32c checksum driver
The extended attribute code now uses the crc32c checksum for hashing
purposes, so we should just always always initialize it. We also want
to prevent NULL pointer dereferences if one of the metadata checksum
features is enabled after the file sytsem is originally mounted.
Signed-off-by: John Donnelly <john.p.donnelly@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
This commit caused IRQs per dev to be reduced from 8 to 4 which resulted in TPCC throughput dropping by 18%.
Revert this commit so we have 8 IRQs per dev again.
Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Håkon Bugge [Thu, 4 Oct 2018 11:04:38 +0000 (13:04 +0200)]
mlx4_core: Disable P_Key Violation Traps
Exadata virt edition, actively using IB partitions, is exposed to
excessive P_Key Violation Traps being sent to the SM. This is close to
a DoS attack. In addition, the OpenSM logs are flooded with these
messages, hiding potential other log messages deemed important to
investigate customer issues.
In fw version 2.35.6312, the traps are disabled, still counting the
P-Key Violations.
This commit will conditionally disable the P_Key Violation Traps
subject to fw version.
It was stuck here in rds_ib_conn_shutdown for ever:
/* quiesce tx and rx completion before tearing down */
while (!wait_event_timeout(rds_ib_ring_empty_wait,
rds_ib_ring_empty(&ic->i_recv_ring) &&
(atomic_read(&ic->i_signaled_sends) == 0),
msecs_to_jiffies(5000))) {
/* Try to reap pending RX completions every 5 secs */
if (!rds_ib_ring_empty(&ic->i_recv_ring)) {
spin_lock_bh(&ic->i_rx_lock);
rds_ib_rx(ic);
spin_unlock_bh(&ic->i_rx_lock);
}
}
The recv ring was not empty.
w_alloc_ptr = 560
w_free_ptr = 256
This is what Mellanox had to say:
When CQ moves to error (e.g. due to CQ Overrun, CQ Access violation) FW will
generate Async event to notify this error, also the QPs that tries to access
this CQ will be put to error state but will not be flushed since we must not
post CQEs to a broken CQ. The QP that tries to access will also issue an
Async catas event.
In summary we cannot wait for any more WR_FLUSH_ERRs in that state.
KarimAllah Ahmed [Sat, 3 Feb 2018 14:56:23 +0000 (15:56 +0100)]
KVM/SVM: Allow direct access to MSR_IA32_SPEC_CTRL
[ Based on a patch from Paolo Bonzini <pbonzini@redhat.com> ]
... basically doing exactly what we do for VMX:
- Passthrough SPEC_CTRL to guests (if enabled in guest CPUID)
- Save and restore SPEC_CTRL around VMExit and VMEntry only if the guest
actually used it.
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jun Nakajima <jun.nakajima@intel.com> Cc: kvm@vger.kernel.org Cc: Dave Hansen <dave.hansen@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ashok Raj <ashok.raj@intel.com> Link: https://lkml.kernel.org/r/1517669783-20732-1-git-send-email-karahmed@amazon.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b2ac58f90540e39324e7a29a7ad471407ae0bf48)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/svm.c
Contextual and also we dropped msr_write_intercepted because we do not use it
(we have other logic for IBRS usage). No changes to svm_vcpu_run() because we
support IBRS and we have other code in place.
Mihai Carabas [Fri, 7 Dec 2018 13:09:51 +0000 (15:09 +0200)]
KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL - reloaded
This commit is filling out the blanks that were missed in the backport 26a0cd21bb76 ("KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL") due to lack
of different interfaces. 26a0cd21bb76 ("KVM/VMX: Allow direct access to
MSR_IA32_SPEC_CTRL") is basically an incomplet cherry-pick from d28b387fb74da95d69d2615732f50cceb38e9a4d.
Also added the interception of MSR_IA32_SPEC_CTRL and
MSR_IA32_PRED_CMD in order for the get/set MSR handling to have a sense.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Ashok Raj [Thu, 1 Feb 2018 21:59:43 +0000 (22:59 +0100)]
KVM/x86: Add IBPB support
The Indirect Branch Predictor Barrier (IBPB) is an indirect branch
control mechanism. It keeps earlier branches from influencing
later ones.
Unlike IBRS and STIBP, IBPB does not define a new mode of operation.
It's a command that ensures predicted branch targets aren't used after
the barrier. Although IBRS and IBPB are enumerated by the same CPUID
enumeration, IBPB is very different.
IBPB helps mitigate against three potential attacks:
* Mitigate guests from being attacked by other guests.
- This is addressed by issing IBPB when we do a guest switch.
* Mitigate attacks from guest/ring3->host/ring3.
These would require a IBPB during context switch in host, or after
VMEXIT. The host process has two ways to mitigate
- Either it can be compiled with retpoline
- If its going through context switch, and has set !dumpable then
there is a IBPB in that path.
(Tim's patch: https://patchwork.kernel.org/patch/10192871)
- The case where after a VMEXIT you return back to Qemu might make
Qemu attackable from guest when Qemu isn't compiled with retpoline.
There are issues reported when doing IBPB on every VMEXIT that resulted
in some tsc calibration woes in guest.
* Mitigate guest/ring0->host/ring0 attacks.
When host kernel is using retpoline it is safe against these attacks.
If host kernel isn't using retpoline we might need to do a IBPB flush on
every VMEXIT.
Even when using retpoline for indirect calls, in certain conditions 'ret'
can use the BTB on Skylake-era CPUs. There are other mitigations
available like RSB stuffing/clearing.
* IBPB is issued only for SVM during svm_free_vcpu().
VMX has a vmclear and SVM doesn't. Follow discussion here:
https://lkml.org/lkml/2018/1/15/146
Please refer to the following spec for more details on the enumeration
and control.
Refer here to get documentation about mitigations.
[peterz: rebase and changelog rewrite]
[karahmed: - rebase
- vmx: expose PRED_CMD if guest has it in CPUID
- svm: only pass through IBPB if guest has it in CPUID
- vmx: support !cpu_has_vmx_msr_bitmap()]
- vmx: support nested]
[dwmw2: Expose CPUID bit too (AMD IBPB only for now as we lack IBRS)
PRED_CMD is a write-only MSR]
Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: kvm@vger.kernel.org Cc: Asit Mallick <asit.k.mallick@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Jun Nakajima <jun.nakajima@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/1515720739-43819-6-git-send-email-ashok.raj@intel.com Link: https://lkml.kernel.org/r/1517522386-18410-3-git-send-email-karahmed@amazon.de
(cherry picked from commit 15d45071523d89b3fb7372e2135fbd72f6af9506)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/cpuid.c
arch/x86/kvm/svm.c
arch/x86/kvm/vmx.c
All the conflicts were contextual. Major differences in the code between UEK4
and upstream (also in UEK4 we only have the feature IBRS, not SPEC_CTRL). We
had to introduce guest_cpuid_has_* functions in cpuid.h for each feature. Also
moved defines in cpuid.h that were needed in cpuid.h and cpuid.c.
Paolo Bonzini [Fri, 15 Jun 2018 09:04:25 +0000 (12:04 +0300)]
KVM: x86: pass host_initiated to functions that read MSRs
SMBASE is only readable from SMM for the VCPU, but it must be always
accessible if userspace is accessing it. Thus, all functions that
read MSRs are changed to accept a struct msr_data; the host_initiated
and index fields are pre-initialized, while the data field is filled
on return.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 609e36d372ad9329269e4a1467bd35311893d1d6)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Paolo Bonzini [Fri, 15 Jun 2018 09:04:24 +0000 (12:04 +0300)]
KVM: VMX: make MSR bitmaps per-VCPU
Place the MSR bitmap in struct loaded_vmcs, and update it in place
every time the x2apic or APICv state can change. This is rare and
the loop can handle 64 MSRs per iteration, in a similar fashion as
nested_vmx_prepare_msr_bitmap.
This prepares for choosing, on a per-VM basis, whether to intercept
the SPEC_CTRL and PRED_CMD MSRs.
Cc: stable@vger.kernel.org # prereq for Spectre mitigation Suggested-by: Jim Mattson <jmattson@google.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry-picked from 904e14fb7cb96401a7dc803ca2863fd5ba32ffe6)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual - different content. Also vmx_enable_intercept_for_msr was already
in UEK4 as part of commit 8d14695f9542e9e0195d6e41ddaa52c32322adf5. We just
changed the signature.
Paolo Bonzini [Fri, 15 Jun 2018 09:04:23 +0000 (12:04 +0300)]
KVM: VMX: introduce alloc_loaded_vmcs
Group together the calls to alloc_vmcs and loaded_vmcs_init. Soon we'll also
allocate an MSR bitmap there.
Cc: stable@vger.kernel.org # prereq for Spectre mitigation Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry-picked from f21f165ef922c2146cc5bdc620f542953c41714b)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kvm/vmx.c
Contextual