Signed-off-by: William Roche <william.roche@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/cpufeatures.h
arch/x86/include/asm/spec_ctrl.h
arch/x86/kernel/cpu/bugs.c
cpufeatures.h vs cpufeature.h in UEK4
include <linux/jump_label.h> header in spec_ctrl.h to use this feature
bugs.c vs bugs_64.c in UEK4
William Roche [Tue, 19 Feb 2019 14:11:12 +0000 (09:11 -0500)]
int3 handler better address space detection on interrupts
In order to prepare the possibility to dynamically change an
interrupt handler code with static_branch_enable/disable,
as the interrupt can equally appear while in user space or
kernel space, the int3 handler itself must better identify if
the original interrupt is from kernel or userland.
Signed-off-by: William Roche <william.roche@oracle.com> Co-developed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
(cherry picked from commit 594fc07cd96784004254680c9e1e4b757fb0a1f5)
Signed-off-by: William Roche <william.roche@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/entry/entry_64.S
entry/entry_64.S vs kernel/entry_64.S in UEK4
Mark Nicholson [Tue, 7 May 2019 23:25:59 +0000 (16:25 -0700)]
repairing out-of-tree build functionality
The current uek4 tree (only) cannot build the binrpm-pkg with an objdir
outside the source tree. This fix redirects the, incorrectly, placed
generated firmware files and firmware parsers into the objtree.
Signed-off-by: Mark Nicholson <mark.j.nicholson@oracle.com> Reviewed-by: Tianyue Lan <tianyue.lan@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Shuning Zhang [Wed, 29 May 2019 07:41:35 +0000 (15:41 +0800)]
ext4: fix false negatives *and* false positives in ext4_check_descriptors()
Ext4_check_descriptors() was getting called before s_gdb_count was
initialized. So for file systems w/o the meta_bg feature, allocation
bitmaps could overlap the block group descriptors and ext4 wouldn't
notice.
For file systems with the meta_bg feature enabled, there was a
fencepost error which would cause the ext4_check_descriptors() to
incorrectly believe that the block allocation bitmap overlaps with the
block group descriptor blocks, and it would reject the mount.
Fix both of these problems.
Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
(cherry picked from commit 44de022c4382541cebdd6de4465d1f4f465ff1dd) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/ext4/super.c
[The contextual has been changed]
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Shuning Zhang [Wed, 22 May 2019 01:32:40 +0000 (09:32 +0800)]
ocfs2: fix ocfs2 read inode data panic in ocfs2_iget
In some cases, ocfs2_iget() reads the data of inode, which has been
deleted for some reason. That will make the system panic. So We should
judge whether this inode has been deleted, and tell the caller that the
inode is a bad inode.
For example, the ocfs2 is used as the backed of nfs, and the client is
nfsv3. This issue can be reproduced by the following steps.
on the nfs server side,
..../patha/pathb
Step 1: The process A was scheduled before calling the function fh_verify.
Step 2: The process B is removing the 'pathb', and just completed the call
to function dput. Then the dentry of 'pathb' has been deleted from the
dcache, and all ancestors have been deleted also. The relationship of
dentry and inode was deleted through the function hlist_del_init. The
following is the call stack.
dentry_iput->hlist_del_init(&dentry->d_u.d_alias)
At this time, the inode is still in the dcache.
Step 3: The process A call the function ocfs2_get_dentry, which get the
inode from dcache. Then the refcount of inode is 1. The following is the
call stack.
nfsd3_proc_getacl->fh_verify->exportfs_decode_fh->fh_to_dentry(ocfs2_get_dentry)
Step 4: Dirty pages are flushed by bdi threads. So the inode of 'patha'
is evicted, and this directory was deleted. But the inode of 'pathb'
can't be evicted, because the refcount of the inode was 1.
Step 5: The process A keep running, and call the function
reconnect_path(in exportfs_decode_fh), which call function
ocfs2_get_parent of ocfs2. Get the block number of parent
directory(patha) by the name of ... Then read the data from disk by the
block number. But this inode has been deleted, so the system panic.
Process A Process B
1. in nfsd3_proc_getacl |
2. | dput
3. fh_to_dentry(ocfs2_get_dentry) |
4. bdi flush dirty cache |
5. ocfs2_iget |
[283465.542049] OCFS2: ERROR (device sdp): ocfs2_validate_inode_block:
Invalid dinode #580640: OCFS2_VALID_FL not set
[283465.545490] Kernel panic - not syncing: OCFS2: (device sdp): panic forced
after error
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@Oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Marcel Holtmann [Fri, 18 Jan 2019 12:43:19 +0000 (13:43 +0100)]
Bluetooth: Verify that l2cap_get_conf_opt provides large enough buffer
The function l2cap_get_conf_opt will return L2CAP_CONF_OPT_SIZE + opt->len
as length value. The opt->len however is in control over the remote user
and can be used by an attacker to gain access beyond the bounds of the
actual packet.
To prevent any potential leak of heap memory, it is enough to check that
the resulting len calculation after calling l2cap_get_conf_opt is not
below zero. A well formed packet will always return >= 0 here and will
end with the length value being zero after the last option has been
parsed. In case of malformed packets messing with the opt->len field the
length value will become negative. If that is the case, then just abort
and ignore the option.
In case an attacker uses a too short opt->len value, then garbage will
be parsed, but that is protected by the unknown option handling and also
the option parameter size checks.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Orabug: 29526426
CVE: CVE-2019-3459
(cherry picked from commit 7c9cbd0b5e38a1672fcd137894ace3b042dfbf69) Signed-off-by: Dan Duval <dan.duval@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Marcel Holtmann [Fri, 18 Jan 2019 11:56:20 +0000 (12:56 +0100)]
Bluetooth: Check L2CAP option sizes returned from l2cap_get_conf_opt
When doing option parsing for standard type values of 1, 2 or 4 octets,
the value is converted directly into a variable instead of a pointer. To
avoid being tricked into being a pointer, check that for these option
types that sizes actually match. In L2CAP every option is fixed size and
thus it is prudent anyway to ensure that the remote side sends us the
right option size along with option paramters.
If the option size is not matching the option type, then that option is
silently ignored. It is a protocol violation and instead of trying to
give the remote attacker any further hints just pretend that option is
not present and proceed with the default values. Implementation
following the specification and its qualification procedures will always
use the correct size and thus not being impacted here.
To keep the code readable and consistent accross all options, a few
cosmetic changes were also required.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Orabug: 29526426
CVE: CVE-2019-3459
(cherry picked from commit af3d5d1c87664a4f150fcf3534c6567cb19909b0) Signed-off-by: Dan Duval <dan.duval@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Ring buffer implementation in hid_debug_event() and hid_debug_events_read()
is strange allowing lost or corrupted data. After commit 717adfdaf147
("HID: debug: check length before copy_to_user()") it is possible to enter
an infinite loop in hid_debug_events_read() by providing 0 as count, this
locks up a system. Fix this by rewriting the ring buffer implementation
with kfifo and simplify the code.
This fixes CVE-2019-3819.
v2: fix an execution logic and add a comment
v3: use __set_current_state() instead of set_current_state()
Backport to v4.4: some (tree-wide) patches are missing in v4.4 so
cherry-pick relevant pieces from:
* 6396bb22151 ("treewide: kzalloc() -> kcalloc()")
* a9a08845e9ac ("vfs: do bulk POLL* -> EPOLL* replacement")
* 92529623d242 ("HID: debug: improve hid_debug_event()")
* 174cd4b1e5fb ("sched/headers: Prepare to move signal wakeup & sigpending
methods from <linux/sched.h> into <linux/sched/signal.h>")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1669187 Cc: stable@vger.kernel.org # v4.18+ Fixes: cd667ce24796 ("HID: use debugfs for events/reports dumping") Fixes: 717adfdaf147 ("HID: debug: check length before copy_to_user()") Signed-off-by: Vladis Dronov <vdronov@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b661fff5f8a0f19824df91cc3905ba2c5b54dc87)
Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
This change has the following effects, in order of descreasing importance:
1) Prevent a stack buffer overflow
2) Do not append an unnecessary NULL to an anyway binary buffer, which
is writing one byte past client_digest when caller is:
chap_string_to_hex(client_digest, chap_r, strlen(chap_r));
The latter was found by KASAN (see below) when input value hes expected size
(32 hex chars), and further analysis revealed a stack buffer overflow can
happen when network-received value is longer, allowing an unauthenticated
remote attacker to smash up to 17 bytes after destination buffer (16 bytes
attacker-controlled and one null). As switching to hex2bin requires
specifying destination buffer length, and does not internally append any null,
it solves both issues.
This addresses CVE-2018-14633.
Beyond this:
- Validate received value length and check hex2bin accepted the input, to log
this rejection reason instead of just failing authentication.
- Only log received CHAP_R and CHAP_C values once they passed sanity checks.
==================================================================
BUG: KASAN: stack-out-of-bounds in chap_string_to_hex+0x32/0x60 [iscsi_target_mod]
Write of size 1 at addr ffff8801090ef7c8 by task kworker/0:0/1021
Signed-off-by: Vincent Pelletier <plr.vincent@gmail.com> Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 755e45f3155cc51e37dc1cce9ccde10b84df7d93)
Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Jason Yan [Tue, 25 Sep 2018 02:56:54 +0000 (10:56 +0800)]
scsi: libsas: fix a race condition when smp task timeout
When the lldd is processing the complete sas task in interrupt and set the
task stat as SAS_TASK_STATE_DONE, the smp timeout timer is able to be
triggered at the same time. And smp_task_timedout() will complete the task
wheter the SAS_TASK_STATE_DONE is set or not. Then the sas task may freed
before lldd end the interrupt process. Thus a use-after-free will happen.
Fix this by calling the complete() only when SAS_TASK_STATE_DONE is not
set. And remove the check of the return value of the del_timer(). Once the
LLDD sets DONE, it must call task->done(), which will call
smp_task_done()->complete() and the task will be completed and freed
correctly.
Reported-by: chenxiang <chenxiang66@hisilicon.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> CC: John Garry <john.garry@huawei.com> CC: Johannes Thumshirn <jthumshirn@suse.de> CC: Ewan Milne <emilne@redhat.com> CC: Christoph Hellwig <hch@lst.de> CC: Tomas Henzl <thenzl@redhat.com> CC: Dan Williams <dan.j.williams@intel.com> CC: Hannes Reinecke <hare@suse.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: John Garry <john.garry@huawei.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit b90cd6f2b905905fb42671009dc0e27c310a16ae)
Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com> Acked-by: Sumit Saxena <sumit.saxena@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit bcf3b67d16a4c8ffae0aa79de5853435e683945c)
Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Young Xiao [Fri, 12 Apr 2019 07:24:30 +0000 (15:24 +0800)]
Bluetooth: hidp: fix buffer overflow
Struct ca is copied from userspace. It is not checked whether the "name"
field is NULL terminated, which allows local users to obtain potentially
sensitive information from kernel stack memory, via a HIDPCONNADD command.
This vulnerability is similar to CVE-2011-1079.
Signed-off-by: Young Xiao <YangX92@hotmail.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Cc: stable@vger.kernel.org
(cherry picked from commit a1616a5ac99ede5d605047a9012481ce7ff18b16)
Kanth Ghatraju [Tue, 21 May 2019 20:54:16 +0000 (16:54 -0400)]
x86/speculation/mds: Add 'mitigations=' support for MDS
Add MDS to the new 'mitigations=' cmdline option.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 5c14068f87d04adc73ba3f41c2a303d3c3d1fa12) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
bugs.c equivalent for UEK4 is bugs64.c
Documentation/admin-guide/kernel-parameters.txt is
Documentation/kernel-parameters.txt in UEK4
Orabug: 29791046 Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com> Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Mao Wenan [Thu, 28 Mar 2019 09:10:56 +0000 (17:10 +0800)]
net: rds: force to destroy connection if t_sock is NULL in rds_tcp_kill_sock().
When it is to cleanup net namespace, rds_tcp_exit_net() will call
rds_tcp_kill_sock(), if t_sock is NULL, it will not call
rds_conn_destroy(), rds_conn_path_destroy() and rds_tcp_conn_free() to free
connection, and the worker cp_conn_w is not stopped, afterwards the net is freed in
net_drop_ns(); While cp_conn_w rds_connect_worker() will call rds_tcp_conn_path_connect()
and reference 'net' which has already been freed.
In rds_tcp_conn_path_connect(), rds_tcp_set_callbacks() will set t_sock = sock before
sock->ops->connect, but if connect() is failed, it will call
rds_tcp_restore_callbacks() and set t_sock = NULL, if connect is always
failed, rds_connect_worker() will try to reconnect all the time, so
rds_tcp_kill_sock() will never to cancel worker cp_conn_w and free the
connections.
Therefore, the condition !tc->t_sock is not needed if it is going to do
cleanup_net->rds_tcp_exit_net->rds_tcp_kill_sock, because tc->t_sock is always
NULL, and there is on other path to cancel cp_conn_w and free
connection. So this patch is to fix this.
rds_tcp_kill_sock():
...
if (net != c_net || !tc->t_sock)
... Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
==================================================================
BUG: KASAN: use-after-free in inet_create+0xbcc/0xd28
net/ipv4/af_inet.c:340
Read of size 4 at addr ffff8003496a4684 by task kworker/u8:4/3721
Fixes: 467fa15356ac("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit cb66ddd156203daefb8d71158036b27b0e2caf63)
Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Kanth Ghatraju [Tue, 21 May 2019 21:55:18 +0000 (17:55 -0400)]
x86/speculation/mds: Check for the right microcode before setting mitigation
With the addition of new mitigation for idle, need to check the availability
of microcode when mds=idle. The fix is to take the check out of the if
statement.
Orabug: 29797118 Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com> Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Eric Dumazet [Sun, 10 Mar 2019 17:36:40 +0000 (10:36 -0700)]
vxlan: test dev->flags & IFF_UP before calling gro_cells_receive()
Same reasons than the ones explained in commit 4179cb5a4c92
("vxlan: test dev->flags & IFF_UP before calling netif_rx()")
netif_rx() or gro_cells_receive() must be called under a strict contract.
At device dismantle phase, core networking clears IFF_UP
and flush_all_backlogs() is called after rcu grace period
to make sure no incoming packet might be in a cpu backlog
and still referencing the device.
A similar protocol is used for gro_cells infrastructure, as
gro_cells_destroy() will be called only after a full rcu
grace period is observed after IFF_UP has been cleared.
Most drivers call netif_rx() from their interrupt handler,
and since the interrupts are disabled at device dismantle,
netif_rx() does not have to check dev->flags & IFF_UP
Virtual drivers do not have this guarantee, and must
therefore make the check themselves.
Fixes: d342894c5d2f ("vxlan: virtual extensible lan") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 59cbf56fcd98ba2a715b6e97c4e43f773f956393)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/vxlan.c
It was applying the patch at vxlan_udp_encap_recv()
instead of vxlan_rcv().
James Smart [Thu, 7 Mar 2019 01:08:12 +0000 (09:08 +0800)]
nvme: allow timed-out ios to retry
Currently the nvme_req_needs_retry() applies several checks to see if
a retry is allowed. On of those is whether the current time has exceeded
the start time of the io plus the timeout length. This check, if an io
times out, means there is never a retry allowed for the io. Which means
applications see the io failure.
Remove this check and allow the io to timeout, like it does on other
protocols, and retries to be made.
On the FC transport, a frame can be lost for an individual io, and there
may be no other errors that escalate for the connection/association.
The io will timeout, which causes the transport to escalate into creating
a new association, but the io that timed out, due to this retry logic, has
already failed back to the application and things are hosed.
Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
(backport from upstream commit 0951338d9677f546e230685d68631dfd3f81cca5)
There are two changes in this nvme_req_needs_retry before this commit,
and there is no functional dependence on our issue. For simpler work,
we don't backport them.
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
rds: Introduce a pool of worker threads for connection management
RDS uses a single threaded work queue for connection management. This
involves creation and deletion of QPs and CQs. On certain HCAs, such
as CX-3, these operations are para-virtualized and some part of the
work has to be conducted by the Physical Function (PF) driver.
In fail-over and fail-back situations, there might be 1000s of
connections to tear down and re-establish. Hence, expand the number
work queues.
The local_wq is removed for simplicity and symmetry reasons.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Tested-by: Rosa Lopez <rosa.lopez@oracle.com> Reviewed-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
---
RDS has a two global work queues, one for loop-back connections and
another one for remote connections. The struct rds_conn_path has a
member cp_wq which is set to one of them. Use cp_wq consistently
instead of the global ones.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Tested-by: Rosa Lopez <rosa.lopez@oracle.com> Reviewed-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
RDS attempts to establish if two rdma_cm_ids actually are the
same. This was implemented by comparing their addresses. But, an
rdma_cm_id may be destroyed and allocated again. Thus, a *new* id, but
the address could still be the same as the *old* one.
Solved by using an idr id as the cm_id's context. Added helper
functions to create and destroy rdma_cm_ids and comparing the
rdma_cm_ids.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Tested-by: Rosa Lopez <rosa.lopez@oracle.com> Reviewed-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
This commit is reverted for two reasons. Firstly, it doesn't fix the
problem it is supposed to fix. Secondly, it may, in some special
circumstances, create a long-lasting connection reject scenario.
As to the first reason, consider the following scenario during
fail-back. Let's say node A fails back first. It sends a DREQ. Node B
drops the IB connection. Now, both will attempt to connect and both
will perform route resolution. Since both nodes attempt to connect at
the same time, you get a race, and then the lower IP will connect. It
does so and succeeds, because both ends have done route
resolution. Traffic continue to flow. Then, node B fails back and the
connection is torn down again. This is exactly what commit 48c2d5f5e258 ("net/rds: prevent RDS connections using stale ARP
entries") said it would prevent.
As to the second reason, the following is an excerpt from the kernel
trace buffer (slightly edited for better brevity):
rds_ib_cm_handle_connect: 1033: saddr ::ffff:192.168.217.20 daddr ::ffff:192.168.216.252 RDSv4.1 lguid 0x10e00001888efa fguid 0x10e00001778a52 tos 0
rds_ib_cm_handle_connect: 1077: no route resolution saddr 0.0.0.0 daddr 0.0.0.0 RDSv4.1 lguid 0x10e00001888efa fguid 0x10e00001778a52 tos 0
rds_ib_cm_handle_connect: 1033: saddr ::ffff:192.168.216.253 daddr ::ffff:192.168.216.252 RDSv4.1 lguid 0x10e00001888efa fguid 0x10e00001887599 tos 0
rds_ib_cm_handle_connect: 1077: no route resolution saddr 0.0.0.0 daddr 0.0.0.0 RDSv4.1 lguid 0x10e00001888efa fguid 0x10e00001887599 tos 0
rds_ib_cm_handle_connect: 1033: saddr ::ffff:192.168.217.20 daddr ::ffff:192.168.216.252 RDSv4.1 lguid 0x10e00001888efa fguid 0x10e00001778a52 tos 2
rds_ib_cm_handle_connect: 1077: no route resolution saddr 0.0.0.0 daddr 0.0.0.0 RDSv4.1 lguid 0x10e00001888efa fguid 0x10e00001778a52 tos 2
rds_ib_cm_handle_connect: 1033: saddr ::ffff:192.168.216.253 daddr ::ffff:192.168.216.252 RDSv4.1 lguid 0x10e00001888efa fguid 0x10e00001887599 tos 0
rds_ib_cm_handle_connect: 1077: no route resolution saddr 0.0.0.0 daddr 0.0.0.0 RDSv4.1 lguid 0x10e00001888efa fguid 0x10e00001887599 tos 0
rds_ib_cm_handle_connect: 1033: saddr ::ffff:192.168.217.20 daddr ::ffff:192.168.216.252 RDSv4.1 lguid 0x10e00001888efa fguid 0x10e00001778a52 tos 4
rds_ib_cm_handle_connect: 1077: no route resolution saddr 0.0.0.0 daddr 0.0.0.0 RDSv4.1 lguid 0x10e00001888efa fguid 0x10e00001778a52 tos 4
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
* net/rds/ib_cm.c
* net/rds/rdma_transport.c
The nature of the conflicts were ftrace points that had been
IPv6-ified.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Tested-by: Rosa Lopez <rosa.lopez@oracle.com> Reviewed-by: Dag Moxnes <dag.moxnes@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Håkon Bugge [Thu, 7 Mar 2019 11:36:44 +0000 (12:36 +0100)]
rds: ib: Flush ARP cache when needed
During Active/Active fail-over and fail-back, the ARP cache may
contain stale entries. Hence flush the foreign address from the ARP
cache when the following events are received:
Suggested-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Tested-by: Rosa Lopez <rosa.lopez@oracle.com> Reviewed-by: Dag Moxnes <dag.moxnes@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Håkon Bugge [Thu, 2 May 2019 12:31:40 +0000 (14:31 +0200)]
rds: Add simple heuristics to determine connect delay
Introduce heuristics to get an estimate of how long it takes to form a
connection. We use this estimate as a factor to delay the passive
side, in case the active is unaware of the connection being down.
The delay should be as short as possible - to avoid extended
connection times in case the remote peer is unaware of the connection
being brought down. On the same time, the delay should be long enough
for the remote active peer to initiate a connect - in the case it is
aware the connection being brought down.
A sysctl variable rds_sysctl_passive_connect_delay_percent is
introduced to have the ability to tune the delay of the allegedly
passive side.
Håkon Bugge [Thu, 7 Mar 2019 11:39:22 +0000 (12:39 +0100)]
rds: Fix one-sided connect
The decision to designate a peer to be the active side did not take
loopback connections into account. Further, a bug in
rds_shutdown_worker where the passive side, in case of no reconnect
racing, did not attempt to restart the connection.
Håkon Bugge [Thu, 7 Mar 2019 11:32:46 +0000 (12:32 +0100)]
rds: ib: Fix gratuitous ARP storm
When the rds Active/Active bonding moves an address to another port,
it informs its peer by sending out 100 gratuitous ARPs (gARPs)
back-to-back.
The gARPs are broadcasts, so this mechanism may flood the fabric with
gARPs. These broadcasts may be dropped, and since the 100 gARPs for
one address are sent consecutive, all the gARPs for a particular
address may be lost.
Fixed by sending far less gARPs (3) and add some interval between them
(5 msec).
The module parameter rds_ib_active_bonding_arps_gap_ms has been
introduced. Both the existing rds_ib_active_bonding_arps and
rds_ib_active_bonding_arps_gap_ms are now writable.
The default values were Suggested-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Orabug: 29391909
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Tested-by: Rosa Lopez <rosa.lopez@oracle.com> Reviewed-by: Sudhakar Dindukurti <sudhakar.dindukurti@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Håkon Bugge [Sun, 17 Feb 2019 14:45:12 +0000 (15:45 +0100)]
IB/mlx4: Increase the timeout for CM cache
Using CX-3 virtual functions, either from a bare-metal machine or
pass-through from a VM, MAD packets are proxied through the PF driver.
Since the VF drivers have separate name spaces for MAD Transaction Ids
(TIDs), the PF driver has to re-map the TIDs and keep the book keeping
in a cache.
Following the RDMA Connection Manager (CM) protocol, it is clear when
an entry has to evicted form the cache. But life is not perfect,
remote peers may die or be rebooted. Hence, it's a timeout to wipe out
a cache entry, when the PF driver assumes the remote peer has gone.
During workloads where a high number of QPs are destroyed concurrently,
excessive amount of CM DREQ retries has been observed
The problem can be demonstrated in a bare-metal environment, where two
nodes have instantiated 8 VFs each. This using dual ported HCAs, so we
have 16 vPorts per physical server.
64 processes are associated with each vPort and creates and destroys
one QP for each of the remote 64 processes. That is, 1024 QPs per
vPort, all in all 16K QPs. The QPs are created/destroyed using the
CM.
When tearing down these 16K QPs, excessive CM DREQ retries (and
duplicates) are observed. With some cat/paste/awk wizardry on the
infiniband_cm sysfs, we observe as sum of the 16 vPorts on one of the
nodes:
Note that the active/passive side is equally distributed between the
two nodes.
Enabling pr_debug in cm.c gives tons of:
[171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave:
1,sl_cm_id: 0xd393089f} is NULL!
By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
tear-down phase of the application is reduced from approximately 90 to
50 seconds. Retries/duplicates are also significantly reduced:
Increasing the timeout further didn't help, as these duplicates and
retries stems from a too short CMA timeout, which was 20 (~4 seconds)
on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
numbers fell down to about 10 for both of them.
Adjustment of the CMA timeout is not part of this commit.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Acked-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
(cherry picked from upstream commit 2612d723aadcf8281f9bf8305657129bd9f3cd57)
Alejandro Jimenez [Wed, 20 Mar 2019 16:55:38 +0000 (12:55 -0400)]
kvm/speculation: Allow KVM guests to use SSBD even if host does not
The bits set in x86_spec_ctrl_mask are used to determine the
allowed value that is written to SPEC_CTRL MSR before VMENTRY,
and controls which mitigations the guest can enable. In the
case of SSBD, unless the host has enabled SSBD always on
(which sets SSBD bit on x86_spec_ctrl_mask), the guest is
unable to use the SSBD mitigation. This was confirmed by
running the SSBD PoC and verifying that guests are always
vulnerable regardless of their own SSBD setting, unless
the host has booted with "spec_store_bypass_disable=on".
Set the SSBD bit in x86_spec_ctrl_mask when the host
CPU supports it, whether or not the host has chosen to
enable the mitigation in any of its modes.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Alejandro Jimenez [Wed, 20 Mar 2019 16:49:58 +0000 (12:49 -0400)]
x86/speculation: Keep enhanced IBRS on when spec_store_bypass_disable=on is used
When SSBD is unconditionally enabled using the kernel parameter
"spec_store_bypass_disable=on", enhanced IBRS is inadvertently turned
off. This happens because the SSBD initialization runs after the code
which selects enhanced IBRS as the spectre V2 mitigation and sets the
IBRS bit on the SPEC_CTRL MSR.
When "spec_store_bypass_disable=on" is used, ssb_init() calls
x86_spec_ctrl_set(SPEC_CTRL_INITIAL), which writes to the SPEC_CTRL
MSR to set the SSBD bit. The value written does not have the IBRS bit
set, since if basic IBRS is in use it will be set during the next
userspace to kernel transition. However, this is not the case for
enhanced IBRS where setting the bit once is sufficient. As a result,
enhanced IBRS remains disabled in this scenario unless manually
enabled afterwards using the sysfs knobs.
Fix the issue by using the correct value with the IBRS bit set when
the enhanced IBRS mitigation is in use.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Alejandro Jimenez [Wed, 20 Mar 2019 16:40:58 +0000 (12:40 -0400)]
x86/speculation: Clean up enhanced IBRS checks in bugs_64.c
There are multiple instances in bugs_64.c where the initialization
code for the various mitigations must check if enhanced IBRS is
selected as the spectre V2 mitigation. Create a short function
to do just that.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Andrea Arcangeli [Fri, 2 Nov 2018 22:47:59 +0000 (15:47 -0700)]
mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings
THP allocation might be really disruptive when allocated on NUMA system
with the local node full or hard to reclaim. Stefan has posted an
allocation stall report on 4.12 based SLES kernel which suggests the
same issue:
The defrag mode is "madvise" and from the above report it is clear that
the THP has been allocated for MADV_HUGEPAGA vma.
Andrea has identified that the main source of the problem is
__GFP_THISNODE usage:
: The problem is that direct compaction combined with the NUMA
: __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
: hard the local node, instead of failing the allocation if there's no
: THP available in the local node.
:
: Such logic was ok until __GFP_THISNODE was added to the THP allocation
: path even with MPOL_DEFAULT.
:
: The idea behind the __GFP_THISNODE addition, is that it is better to
: provide local memory in PAGE_SIZE units than to use remote NUMA THP
: backed memory. That largely depends on the remote latency though, on
: threadrippers for example the overhead is relatively low in my
: experience.
:
: The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
: extremely slow qemu startup with vfio, if the VM is larger than the
: size of one host NUMA node. This is because it will try very hard to
: unsuccessfully swapout get_user_pages pinned pages as result of the
: __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
: allocations and instead of trying to allocate THP on other nodes (it
: would be even worse without vfio type1 GUP pins of course, except it'd
: be swapping heavily instead).
Fix this by removing __GFP_THISNODE for THP requests which are
requesting the direct reclaim. This effectivelly reverts 5265047ac301
on the grounds that the zone/node reclaim was known to be disruptive due
to premature reclaim when there was memory free. While it made sense at
the time for HPC workloads without NUMA awareness on rare machines, it
was ultimately harmful in the majority of cases. The existing behaviour
is similar, if not as widespare as it applies to a corner case but
crucially, it cannot be tuned around like zone_reclaim_mode can. The
default behaviour should always be to cause the least harm for the
common case.
If there are specialised use cases out there that want zone_reclaim_mode
in specific cases, then it can be built on top. Longterm we should
consider a memory policy which allows for the node reclaim like behavior
for the specific memory ranges which would allow a
: Both patches look correct to me but I'm responding to this one because
: it's the fix. The change makes sense and moves further away from the
: severe stalling behaviour we used to see with both THP and zone reclaim
: mode.
:
: I put together a basic experiment with usemem configured to reference a
: buffer multiple times that is 80% the size of main memory on a 2-socket
: box with symmetric node sizes and defrag set to "always". The defrag
: setting is not the default but it would be functionally similar to
: accessing a buffer with madvise(MADV_HUGEPAGE). Usemem is configured to
: reference the buffer multiple times and while it's not an interesting
: workload, it would be expected to complete reasonably quickly as it fits
: within memory. The results were;
:
: usemem
: vanilla noreclaim-v1
: Amean Elapsd-1 42.78 ( 0.00%) 26.87 ( 37.18%)
: Amean Elapsd-3 27.55 ( 0.00%) 7.44 ( 73.00%)
: Amean Elapsd-4 5.72 ( 0.00%) 5.69 ( 0.45%)
:
: This shows the elapsed time in seconds for 1 thread, 3 threads and 4
: threads referencing buffers 80% the size of memory. With the patches
: applied, it's 37.18% faster for the single thread and 73% faster with two
: threads. Note that 4 threads showing little difference does not indicate
: the problem is related to thread counts. It's simply the case that 4
: threads gets spread so their workload mostly fits in one node.
:
: The overall view from /proc/vmstats is more startling
:
: 4.19.0-rc1 4.19.0-rc1
: vanillanoreclaim-v1r1
: Minor Faults 35593425 708164
: Major Faults 484088 36
: Swap Ins 3772837 0
: Swap Outs 3932295 0
:
: Massive amounts of swap in/out without the patch
:
: Direct pages scanned 6013214 0
: Kswapd pages scanned 0 0
: Kswapd pages reclaimed 0 0
: Direct pages reclaimed 4033009 0
:
: Lots of reclaim activity without the patch
:
: Kswapd efficiency 100% 100%
: Kswapd velocity 0.000 0.000
: Direct efficiency 67% 100%
: Direct velocity 11191.956 0.000
:
: Mostly from direct reclaim context as you'd expect without the patch.
:
: Page writes by reclaim 3932314.000 0.000
: Page writes file 19 0
: Page writes anon 3932295 0
: Page reclaim immediate 42336 0
:
: Writes from reclaim context is never good but the patch eliminates it.
:
: We should never have default behaviour to thrash the system for such a
: basic workload. If zone reclaim mode behaviour is ever desired but on a
: single task instead of a global basis then the sensible option is to build
: a mempolicy that enforces that behaviour.
This was a severe regression compared to previous kernels that made
important workloads unusable and it starts when __GFP_THISNODE was
added to THP allocations under MADV_HUGEPAGE. It is not a significant
risk to go to the previous behavior before __GFP_THISNODE was added, it
worked like that for years.
This was simply an optimization to some lucky workloads that can fit in
a single node, but it ended up breaking the VM for others that can't
possibly fit in a single node, so going back is safe.
[mhocko@suse.com: rewrote the changelog based on the one from Andrea] Link: http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Debugged-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Mel Gorman <mgorman@techsingularity.net> Tested-by: Mel Gorman <mgorman@techsingularity.net> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: <stable@vger.kernel.org> [4.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit ac5b2c18911ffe95c08d69273917f90212cf5659) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Larry Bassel <larry.bassel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
mm/mempolicy.c
[__GFP_DIRECT_RECLAIM does not exist in UEK4, use __GFP_WAIT]
Michael Chan [Mon, 8 Apr 2019 21:39:55 +0000 (17:39 -0400)]
bnxt_en: Reset device on RX buffer errors.
If the RX completion indicates RX buffers errors, the RX ring will be
disabled by firmware and no packets will be received on that ring from
that point on. Recover by resetting the device.
Fixes: c0c050c58d84 ("bnxt_en: New Broadcom ethernet driver.") Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 8e44e96c6c8e8fb80b84a2ca11798a8554f710f2)
Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com> Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Boris Ostrovsky [Mon, 13 May 2019 22:29:42 +0000 (18:29 -0400)]
x86/mitigations: Fix the test for Xen PV guest
Commit 6af1c37c19ea ("x86/pti: Don't report XenPV as vulnerable")
looks at current hypervisor to determine whether we are running as a Xen
PV guest. This is incorrect since the test will be true for HVM guests
as well.
Instead we should see if we are xen_pv_domain().
(Using Xen-specific primitives in this file is not ideal. This is not
for upstream though so we are going to have to live with this)
Fixes: 6af1c37c19ea ("x86/pti: Don't report XenPV as vulnerable") Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Patrick Colp <patrick.colp@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Kanth Ghatraju [Thu, 16 May 2019 19:54:40 +0000 (15:54 -0400)]
x86/speculation/mds: Fix verw usage to use memory operand
verw instruction needs to be called with a memory operand instead
of the register operand to correctly flush the buffers affected by
MDS. The buffer overwriting occurs regards less of permission check
as well as the null selector.
Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Hannes Reinecke [Thu, 13 Oct 2016 13:10:41 +0000 (15:10 +0200)]
scsi: libfc: sanitize E_D_TOV and R_A_TOV setting
When setting the FCP timeout we need to ensure a lower boundary
for E_D_TOV and R_A_TOV, otherwise we'd be getting spurious I/O
issues due to the fcp timer firing too early.
Reviewed-by: John Sobecki <john.sobecki@oracle.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Acked-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: John Sobecki <john.sobecki@oracle.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Acked-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: John Sobecki <john.sobecki@oracle.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Acked-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Hannes Reinecke [Fri, 30 Sep 2016 09:01:19 +0000 (11:01 +0200)]
scsi: libfc: don't advance state machine for incoming FLOGI
When we receive an FLOGI but have already sent our own we should
not advance the state machine but rather wait for our FLOGI to
return before continuing with PLOGI.
Reviewed-by: John Sobecki <john.sobecki@oracle.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Acked-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: John Sobecki <john.sobecki@oracle.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Acked-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Hannes Reinecke [Fri, 30 Sep 2016 09:01:17 +0000 (11:01 +0200)]
scsi: libfc: Do not drop down to FLOGI for fc_rport_login()
When fc_rport_login() is called while the rport is not
in RPORT_ST_INIT, RPORT_ST_READY, or RPORT_ST_DELETE
login is already in progress and there's no need to
drop down to FLOGI; doing so will only confuse the
other side.
Reviewed-by: John Sobecki <john.sobecki@oracle.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Acked-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Chad Dupuis [Fri, 30 Sep 2016 09:01:16 +0000 (11:01 +0200)]
scsi: libfc: Do not take rdata->rp_mutex when processing a -FC_EX_CLOSED ELS response.
When an ELS response handler receives a -FC_EX_CLOSED, the rdata->rp_mutex is
already held which can lead to a deadlock condition like the following stack trace:
The other ELS handlers need to follow the FLOGI response handler and simply do
a kref_put against the fc_rport_priv struct and exit when receving a
-FC_EX_CLOSED response.
Reviewed-by: John Sobecki <john.sobecki@oracle.com> Signed-off-by: Chad Dupuis <chad.dupuis@cavium.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Acked-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Hannes Reinecke [Fri, 30 Sep 2016 09:01:15 +0000 (11:01 +0200)]
scsi: libfc: Fixup disc_mutex handling
The list of attached 'rdata' remote port structures is RCU
protected, so there is no need to take the 'disc_mutex' when
traversing it.
Rather we should be using rcu_read_lock() and kref_get_unless_zero()
to validate the entries.
We need, however, take the disc_mutex when deleting an entry;
otherwise we risk clashes with list_add.
Reviewed-by: John Sobecki <john.sobecki@oracle.com> Signed-off-by: Hannes Reinecke <hare@suse.com> Acked-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Ajaykumar Hotchandani [Wed, 27 Feb 2019 22:53:57 +0000 (14:53 -0800)]
xve: arm ud tx cq to generate completion interrupts
IPoIB polls for UD send cq for every 16th post_send() request to reduce
interrupt count; and it does not arm UD send cq (16 is controlled by
MAX_SEND_CQE variable)
XVE has followed IPoIB methodology in terms of handling UD send cq;
however, it missed to poll send cq after certain number of iterations.
This makes freeing of resources related to work request unreliable since
completion arrival is not controlled. This caused problem for live
migration; since initial UDP and ICMP skbs which are using UD work
requests are not getting freed. And, xenwatch process is getting stuck
on waiting for these skbs to be freed.
This patch does following:
- arm send cq at initialization. This will generate interrupt for
initial ud send requests.
- Once polling of send cq is completed, arm send cq again to generate
interrupt whenever next cqe arrives.
I'm going back to interrupt mechanism, since UD workload for xve is
extremely limited. And, I don't expect to generate interrupt flood here.
And, I don't want to miss out on freeing of skb (for example, if
scenario ends up as, only 10 post_send() are attempted for UD QP; and
after that, we try to live migrate that VM, we may miss completion if
our logic is, poll CQ at every 16th post_send() iteration)
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com> Reviewed-by: Chien Yen <chien.yen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Alexei Starovoitov [Fri, 1 May 2015 03:14:07 +0000 (20:14 -0700)]
net: sched: run ingress qdisc without locks
TC classifiers/actions were converted to RCU by John in the series:
http://thread.gmane.org/gmane.linux.network/329739/focus=329739
and many follow on patches.
This is the last patch from that series that finally drops
ingress spin_lock.
Single cpu ingress+u32 performance goes from 22.9 Mpps to 24.5 Mpps.
In two cpu case when both cores are receiving traffic on the same
device and go into the same ingress+u32 the performance jumps
from 4.5 + 4.5 Mpps to 23.5 + 23.5 Mpps
Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 087c1a601ad7f851a2d31f5fa0e5e9dfc766df55)
Orabug: 29395374 Signed-off-by: Calum Mackay <calum.mackay@oracle.com> Reviewed-by: Laurence Rochfort <laurence.rochfort@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
The logic that polls for the firmware message response uses a shorter
sleep interval for the first few passes. But there was a typo so it
was using the wrong counter (larger counter) for these short sleep
passes. The result is a slightly shorter timeout period for these
firmware messages than intended. Fix it by using the proper counter.
Fixes: 9751e8e71487 ("bnxt_en: reduce timeout on initial HWRM calls") Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Allen Pais <allen.pais@oracle.com>
The code waits up to 20 usec for the firmware response to complete
once we've seen the valid response header in the buffer. It turns
out that in some scenarios, this wait time is not long enough.
Extend it to 150 usec and use usleep_range() instead of udelay().
Fixes: 9751e8e71487 ("bnxt_en: reduce timeout on initial HWRM calls") Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: Allen Pais <allen.pais@oracle.com>
Syzbot caught an oops at unregister_shrinker() because combination of
commit 1d3d4437eae1bb29 ("vmscan: per-node deferred work") and fault
injection made register_shrinker() fail and the caller of
register_shrinker() did not check for failure.
Since allowing register_shrinker() callers to call unregister_shrinker()
when register_shrinker() failed can simplify error recovery path, this
patch makes unregister_shrinker() no-op when register_shrinker() failed.
Also, reset shrinker->nr_deferred in case unregister_shrinker() was
by error called twice.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Glauber Costa <glauber@scylladb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 7880fc541566166d140954825fc83c826534e622)
Signed-off-by: John Sobecki <john.sobecki@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
David Howells [Wed, 24 Feb 2016 14:37:54 +0000 (14:37 +0000)]
X.509: Handle midnight alternative notation in GeneralizedTime
The ASN.1 GeneralizedTime object carries an ISO 8601 format date and time.
The time is permitted to show midnight as 00:00 or 24:00 (the latter being
equivalent of 00:00 of the following day).
The permitted value is checked in x509_decode_time() but the actual
handling is left to mktime64().
Without this patch, certain X.509 certificates will be rejected and could
lead to an unbootable kernel.
Note that with this patch we also permit any 24:mm:ss time and extend this
to UTCTime, which whilst not strictly correct don't permit much leeway in
fiddling date strings.
Reported-by: Rudolf Polzer <rpolzer@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de>
cc: David Woodhouse <David.Woodhouse@intel.com>
cc: John Stultz <john.stultz@linaro.org>
Orabug: 29460344
CVE: CVE-2015-5327
(cherry picked from commit 7650cb80e4e90b0fae7854b6008a46d24360515f) Signed-off-by: Dan Duval <dan.duval@oracle.com> Reviewed-by: John Haxby <john.haxby@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
David Howells [Wed, 24 Feb 2016 14:37:53 +0000 (14:37 +0000)]
X.509: Support leap seconds
The format of ASN.1 GeneralizedTime seems to be specified by ISO 8601
[X.680 46.3] and this apparently supports leap seconds (ie. the seconds
field is 60). It's not entirely clear that ASN.1 expects it, but we can
relax the seconds check slightly for GeneralizedTime.
This results in us passing a time with sec as 60 to mktime64(), which
handles it as being a duplicate of the 0th second of the next minute.
We can't really do otherwise without giving the kernel much greater
knowledge of where all the leap seconds are. Unfortunately, this would
require change the mapping of the kernel's current-time-in-seconds.
UTCTime, however, only supports a seconds value in the range 00-59, but for
the sake of simplicity allow this with UTCTime also.
Without this patch, certain X.509 certificates will be rejected,
potentially making a kernel unbootable.
Reported-by: Rudolf Polzer <rpolzer@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de>
cc: David Woodhouse <David.Woodhouse@intel.com>
cc: John Stultz <john.stultz@linaro.org>
Orabug: 29460344
CVE: CVE-2015-5327
(cherry picked from commit da02559c9f864c8d62f524c1e0b64173711a16ab) Signed-off-by: Dan Duval <dan.duval@oracle.com> Reviewed-by: John Haxby <john.haxby@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
David Howells [Thu, 12 Nov 2015 09:36:40 +0000 (09:36 +0000)]
X.509: Fix the time validation [ver #2]
This fixes CVE-2015-5327. It affects kernels from 4.3-rc1 onwards.
Fix the X.509 time validation to use month number-1 when looking up the
number of days in that month. Also put the month number validation before
doing the lookup so as not to risk overrunning the array.
Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
is_broadcast_packet() expands to compare_ether_addr() which doesn't
exist since commit 7367d0b573d1 ("drivers/net: Convert uses of
compare_ether_addr to ether_addr_equal"). It turns out it's actually not
used.
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
the be2net implementation of .ndo_tunnel_{add,del}() changes the value of
NETIF_F_GSO_UDP_TUNNEL bit in 'features' and 'hw_features', but it forgets
to call netdev_features_change(). Moreover, ethtool setting for that bit
can potentially be reverted after a tunnel is added or removed.
GSO already does software segmentation when 'hw_enc_features' is 0, even
if VXLAN offload is turned on. In addition, commit 096de2f83ebc ("benet:
stricter vxlan offloading check in be_features_check") avoids hardware
segmentation of non-VXLAN tunneled packets, or VXLAN packets having wrong
destination port. So, it's safe to avoid flipping the above feature on
addition/deletion of VXLAN tunnels.
Fixes: 630f4b70567f ("be2net: Export tunnel offloads only when a VxLAN tunnel is created") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
DMA allocated memory is lost in be_cmd_get_profile_config() when we
call it with non-NULL port_res parameter.
Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
When enable skyhawk only it will reduce module size by ~20kb
New help style in Kconfig
Reviewed-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: Petr Oros <poros@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.
Addresses-Coverity-ID: 114787 ("Missing break in switch") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Trivial fix to spelling mistake in dev_info message.
Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
1) This patch gathers and prints the following info that can
help in diagnosing the cause of a TX-timeout.
a) TX queue and completion queue entries.
b) SKB and TCP/UDP header details.
2) For Lancer NICs (TX-timeout recovery is not supported for
BE3/Skyhawk-R NICs), it recovers from the TX timeout as follows:
a) On a TX-timeout, driver sets the PHYSDEV_CONTROL_FW_RESET_MASK
bit in the PHYSDEV_CONTROL register. Lancer firmware goes into
an error state and indicates this back to the driver via a bit
in a doorbell register.
b) Driver detects this and calls be_err_recover(). DMA is disabled,
all pending TX skbs are unmapped and freed (be_close()). All rings
are destroyed (be_clear()).
c) The driver waits for the FW to re-initialize and re-creates all
rings along with other data structs (be_resume())
Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
The current position of .rss_flags field in struct rss_info causes
that fields .rsstable and .rssqueue (both 128 bytes long) crosses
cache-line boundaries. Moving it at the end properly align all fields.
Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
- Unionize two u8 fields where only one of them is used depending on NIC
chipset.
- Move recovery_supported field after that union
These changes eliminate 7-bytes hole in the struct and makes it smaller
by 8 bytes.
Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Re-order fields in struct be_eq_obj to ensure that .napi field begins
at start of cache-line. Also the .adapter field is moved to the first
cache-line next to .q field and 3 fields (idx,msi_idx,spurious_intr)
and the 4-bytes hole to 3rd cache-line.
Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
The commit fb6113e688e0 ("be2net: get rid of custom busy poll code")
replaced custom busy-poll code by the generic one but left several
macros and fields in struct be_eq_obj that are currently unused.
Remove this stuff.
Fixes: fb6113e688e0 ("be2net: get rid of custom busy poll code") Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
The commit 2632bafd74ae ("be2net: fix adaptive interrupt coalescing")
introduced a separate struct be_aic_obj to hold AIC information but
unfortunately left the old stuff in be_eq_obj. So remove it.
Fixes: 2632bafd74ae ("be2net: fix adaptive interrupt coalescing") Signed-off-by: Ivan Vecera <cera@cera.cz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Check for 0xE00 (RECOVERABLE_ERR) along with ARMFW UE (0x0)
in be_detect_error() to know whether the error is valid error or not
Fixes: 673c96e5a ("be2net: Fix UE detection logic for BE3") Signed-off-by: Suresh Reddy <suresh.reddy@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Martin K. Petersen [Thu, 28 Sep 2017 01:38:59 +0000 (21:38 -0400)]
scsi: sd: Do not override max_sectors_kb sysfs setting
A user may lower the max_sectors_kb setting in sysfs to accommodate
certain workloads. Previously we would always set the max I/O size to
either the block layer default or the optional preferred I/O size
reported by the device.
Keep the current heuristics for the initial setting of max_sectors_kb.
For subsequent invocations, only update the current queue limit if it
exceeds the capabilities of the hardware.
Cc: <stable@vger.kernel.org> Reported-by: Don Brace <don.brace@microsemi.com> Reviewed-by: Martin Wilck <mwilck@suse.com> Tested-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 77082ca503bed061f7fbda7cfd7c93beda967a41)
Signed-off-by: John Sobecki <john.sobecki@oracle.com> Tested-by: Dustin Samko <Dustin.Samko@gm.com> Reviewed-by: Ritika Srivastava <ritika.srivastava@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
There have been reports of oversize UDP packets being sent to the
driver to be transmitted, causing error conditions. The issue is
likely caused by the dst of the SKB switching between 'lo' with
64K MTU and the hardware device with a smaller MTU. Patches are
being proposed by Mahesh Bandewar <maheshb@google.com> to fix the
issue.
In the meantime, add a quick length check in the driver to prevent
the error. The driver uses the TX packet size as index to look up an
array to setup the TX BD. The array is large enough to support all MTU
sizes supported by the driver. The oversize TX packet causes the
driver to index beyond the array and put garbage values into the
TX BD. Add a simple check to prevent this.
Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 2b3c6885386020b1b9d92d45e8349637e27d1f66) Signed-off-by: Brian Maly <brian.maly@oracle.com>
x86/speculation: Read per-cpu value of x86_spec_ctrl_priv in x86_virt_spec_ctrl()
In x86_virt_spec_ctrl(), when IBRS is in use on the host, the baseline to
restore the host SPEC_CTRL must be taken from the privileged value which
has the IBRS bit set. In addition, it must be read from the per cpu variable
(x86_spec_ctrl_priv_cpu) that holds the SPEC_CTRL MSR for the current cpu.
Currently, this line:
hostval = this_cpu_read(x86_spec_ctrl_priv);
incorrectly uses the global x86_spec_ctrl_priv instead of the correct
per-cpu variable x86_spec_ctrl_priv_cpu, which assigns spurious values
to hostval.
Fix this issue by reading the correct per-cpu value instead of the
global.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Alejandro Jimenez [Wed, 20 Mar 2019 15:00:49 +0000 (11:00 -0400)]
x86/speculation: Keep enhanced IBRS on when prctl is used for SSBD control
When using the prctl system call to enable/disable SSBD mitigation for a
specific thread, it is necessary to update the SPEC_CTRL MSR on the CPU
running it. The value used as the base for the msr that will be written
is x86_spec_ctrl_base, which does not have the IBRS bit set. The relevant
SSBD bits are OR'd to this value before it is written to the MSR, but
the IBRS bit will remain unset.
As a result, the thread that requested the SSBD protection will run without
IBRS enabled, and when it is context switched out, IBRS will not be turned
back on again. This is not a problem in processors that use basic IBRS since
the bit is constantly toggled on kernel entry, but with enhanced IBRS this
is not necessary and therefore the bit remains unset.
Fix it by adding a check to detect when enhanced IBRS is in use, and add
the bit to the msr value that will be used as the baseline.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Hui Peng [Wed, 12 Dec 2018 11:42:24 +0000 (12:42 +0100)]
USB: hso: Fix OOB memory access in hso_probe/hso_get_config_data
The function hso_probe reads if_num from the USB device (as an u8) and uses
it without a length check to index an array, resulting in an OOB memory read
in hso_probe or hso_get_config_data.
Add a length check for both locations and updated hso_probe to bail on
error.
This issue has been assigned CVE-2018-19985.
Reported-by: Hui Peng <benquike@gmail.com> Reported-by: Mathias Payer <mathias.payer@nebelwelt.net> Signed-off-by: Hui Peng <benquike@gmail.com> Signed-off-by: Mathias Payer <mathias.payer@nebelwelt.net> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 5146f95df782b0ac61abde36567e718692725c89)
swiotlb: save io_tlb_used to local variable before leaving critical section
When swiotlb is full, the kernel would print io_tlb_used. However, the
result might be inaccurate at that time because we have left the critical
section protected by spinlock.
Therefore, we backup the io_tlb_used into local variable before leaving
critical section.
Fixes: 83ca25948940 ("swiotlb: dump used and total slots when swiotlb buffer is full") Suggested-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 29637525
(cherry picked from commit 53b29c336830db48ad3dc737f88b8c065b1f0851) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
kernel/dma/swiotlb.c does not exist.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-By: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
swiotlb: dump used and total slots when swiotlb buffer is full
So far the kernel only prints the requested size if swiotlb buffer if full.
It is not possible to know whether it is simply an out of buffer, or it is
because swiotlb cannot allocate buffer with the requested size due to
fragmentation.
As 'io_tlb_used' is available since commit 71602fe6d4e9 ("swiotlb: add
debugfs to track swiotlb buffer usage"), both 'io_tlb_used' and
'io_tlb_nslabs' are printed when swiotlb buffer is full.
(cherry picked from commit 83ca259489409a1fe8a83dad83a82f32174d4f31) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
kernel/dma/swiotlb.c does not exist.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-By: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
x86/bugs, kvm: don't miss SSBD when IBRS is in use.
When IBRS is in use, we unconditionnaly need to write to MSR_IA32_SPEC_CTRL
(it acts as a barrier) but we were failing to take into account the SSBD
state from the thread info flags, potentially disabling SSBD on the host on
tasks that needs it after a vmexit.
Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Shuning Zhang [Mon, 15 Apr 2019 15:01:37 +0000 (23:01 +0800)]
cifs: Fix use after free of a mid_q_entry
With protocol version 2.0 mounts we have seen crashes with corrupt mid
entries. Either the server->pending_mid_q list becomes corrupt with a
cyclic reference in one element or a mid object fetched by the
demultiplexer thread becomes overwritten during use.
Code review identified a race between the demultiplexer thread and the
request issuing thread. The demultiplexer thread seems to be written
with the assumption that it is the sole user of the mid object until
it calls the mid callback which either wakes the issuer task or
deletes the mid.
This assumption is not true because the issuer task can be woken up
earlier by a signal. If the demultiplexer thread has proceeded as far
as setting the mid_state to MID_RESPONSE_RECEIVED then the issuer
thread will happily end up calling cifs_delete_mid while the
demultiplexer thread still is using the mid object.
Inserting a delay in the cifs demultiplexer thread widens the race
window and makes reproduction of the race very easy:
if (server->large_buf)
buf = server->bigbuf;
+ usleep_range(500, 4000);
server->lstrp = jiffies;
To resolve this I think the proper solution involves putting a
reference count on the mid object. This patch makes sure that the
demultiplexer thread holds a reference until it has finished
processing the transaction.
Cc: stable@vger.kernel.org Signed-off-by: Lars Persson <larper@axis.com> Acked-by: Paulo Alcantara <palcantara@suse.de> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
(cherry picked from commit 696e420bb2a6624478105651d5368d45b502b324)
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/cifs/connect.c
fs/cifs/smb2ops.c
fs/cifs/smb2transport.c
fs/cifs/transport.c
[
connect.c: contextual has changed.
smb2ops.c: hdr->Command changed to shdr->Command in the third line.
^
smb2transport.c: contextual has changed, the codes are the a statement
block of else, but this statement block has been moved outside.
transport.c: contextual has changed, the codes are the a statement
block of else, but this statement block has been moved outside.
]
Linus Torvalds [Mon, 22 Aug 2016 23:41:46 +0000 (16:41 -0700)]
binfmt_elf: switch to new creds when switching to new mm
We used to delay switching to the new credentials until after we had
mapped the executable (and possible elf interpreter). That was kind of
odd to begin with, since the new executable will actually then _run_
with the new creds, but whatever.
The bigger problem was that we also want to make sure that we turn off
prof events and tracing before we start mapping the new executable
state. So while this is a cleanup, it's also a fix for a possible
information leak.
Reported-by: Robert Święcki <robert@swiecki.net> Tested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David Howells <dhowells@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 9f834ec18defc369d73ccf9e87a2790bfa05bf46)
Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Boris Ostrovsky [Thu, 9 May 2019 15:04:38 +0000 (11:04 -0400)]
x86/microcode: Don't return error if microcode update is not needed
Commit 347b54683 ("x86/microcode: Synchronize late microcode loading")
incorrectly returns -EINVAL error on all request_microcode_fw() failures
in reload_store(). In fact, when update is not needed or if there is no
microcode to load we don't need to treat this as an error.
Konrad Rzeszutek Wilk [Fri, 3 May 2019 02:01:38 +0000 (22:01 -0400)]
x86/mds: Add empty commit for CVE-2019-11091
The fixes for MDS also cover this CVE which states to be:
"Microarchitectural Data SamplingUncacheable Memory(MDSUM): Uncacheable
memory on some microprocessors utilizing speculative execution may allow
an authenticated user to potentially enable information disclosure
via a side channel with local access"
Boris Ostrovsky [Wed, 8 May 2019 18:50:39 +0000 (14:50 -0400)]
x86/microcode: Add loader version file in debugfs
We want to be able to find out whether late microcode loader is using a
"safe" method where the system is in stop_machines() --- i.e. all cores
are pinned in kernel with interrupts disabled. This is especially
important for core siblings --- if one thread is loading microcode while
the other is executing instructions that are being patched then bad
things may happen, including MCEs.
Presense of this file indicates that we are all good. We will also
provide version value of "1".
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Borislav Petkov [Wed, 14 Mar 2018 18:36:15 +0000 (19:36 +0100)]
x86/microcode: Fix CPU synchronization routine
Emanuel reported an issue with a hang during microcode update because my
dumb idea to use one atomic synchronization variable for both rendezvous
- before and after update - was simply bollocks:
microcode: microcode_reload_late: late_cpus: 4
microcode: __reload_late: cpu 2 entered
microcode: __reload_late: cpu 1 entered
microcode: __reload_late: cpu 3 entered
microcode: __reload_late: cpu 0 entered
microcode: __reload_late: cpu 1 left
microcode: Timeout while waiting for CPUs rendezvous, remaining: 1
CPU1 above would finish, leave and the others will still spin waiting for
it to join.
So do two synchronization atomics instead, which makes the code a lot more
straightforward.
Also, since the update is serialized and it also takes quite some time per
microcode engine, increase the exit timeout by the number of CPUs on the
system.
That's ok because the moment all CPUs are done, that timeout will be cut
short.
Furthermore, panic when some of the CPUs timeout when returning from a
microcode update: we can't allow a system with not all cores updated.
Also, as an optimization, do not do the exit sync if microcode wasn't
updated.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Borislav Petkov [Wed, 8 May 2019 15:10:41 +0000 (11:10 -0400)]
x86/microcode: Synchronize late microcode loading
Original idea by Ashok, completely rewritten by Borislav.
Before you read any further: the early loading method is still the
preferred one and you should always do that. The following patch is
improving the late loading mechanism for long running jobs and cloud use
cases.
Gather all cores and serialize the microcode update on them by doing it
one-by-one to make the late update process as reliable as possible and
avoid potential issues caused by the microcode update.
[ Borislav: Rewrite completely. ]
Co-developed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Tested-by: Ashok Raj <ashok.raj@intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com> Link: https://lkml.kernel.org/r/20180228102846.13447-8-bp@alien8.de
(cherry picked from commit a5321aec6412b20b5ad15db2d6b916c05349dbff)
Conflicts --- quite a few. Notable ones:
* We don't have microcode cache and so call request_microcode_fw() for
each CPU
* No need to get/put_online_cpus() --- they are part of stop_machine()
* No stop_machine_cpuslocked() in uek4 but uek4's version of
stop_machine() prevents CPU hotplug.
* uek4's has fewer result codes for microcode operations and thus error
handling is slightly different.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Configure x86 runtime CPU speculation bug mitigations in accordance with
the 'mitigations=' cmdline option. This affects Meltdown, Spectre v2,
Speculative Store Bypass, and L1TF.
The default behavior is unchanged.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86) Reviewed-by: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jon Masters <jcm@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-arch@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Price <steven.price@arm.com> Cc: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/6616d0ae169308516cfdf5216bedd169f8a8291b.1555085500.git.jpoimboe@redhat.com
(cherry picked from commit aaa95f2f1112dd4ec31ae13c4cf877dc7c7fcbc8)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
Documentation/admin-guide/kernel-parameters.txt
arch/x86/kernel/cpu/bugs.c
arch/x86/mm/pti.c
Documentation/admin-guide/kernel-parameters.txt: different location
arch/x86/kernel/cpu/bugs.c: different name (bugs_64.c). Also we have different logic in nospectre_v2.
arch/x86/mm/pti.c: different location for mitigation arch/x86/mm/kaiser.c.
Keeping track of the number of mitigations for all the CPU speculation
bugs has become overwhelming for many users. It's getting more and more
complicated to decide which mitigations are needed for a given
architecture. Complicating matters is the fact that each arch tends to
have its own custom way to mitigate the same vulnerability.
Most users fall into a few basic categories:
a) they want all mitigations off;
b) they want all reasonable mitigations on, with SMT enabled even if
it's vulnerable; or
c) they want all reasonable mitigations on, with SMT disabled if
vulnerable.
Define a set of curated, arch-independent options, each of which is an
aggregation of existing options:
- mitigations=off: Disable all mitigations.
- mitigations=auto: [default] Enable all the default mitigations, but
leave SMT enabled, even if it's vulnerable.
- mitigations=auto,nosmt: Enable all the default mitigations, disabling
SMT if needed by a mitigation.
Currently, these options are placeholders which don't actually do
anything. They will be fleshed out in upcoming patches.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jiri Kosina <jkosina@suse.cz> (on x86) Reviewed-by: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@alien8.de> Cc: "H . Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jiri Kosina <jikos@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jon Masters <jcm@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: linuxppc-dev@lists.ozlabs.org Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux-s390@vger.kernel.org Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-arch@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steven Price <steven.price@arm.com> Cc: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/b07a8ef9b7c5055c3a4637c87d07c296d5016fe0.1555085500.git.jpoimboe@redhat.com
(cherry picked from commit 6cbbaa933b325234d6ffc93836d6b7c06dea7a56)
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
Documentation/admin-guide/kernel-parameters.txt
Different location: Documentation/kernel-parameters.txt
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
bugs.c vs bug_64.c in UEK4
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
bugs.c vs bugs_64.c on UEK4
Mihai Carabas [Fri, 12 Apr 2019 11:20:40 +0000 (14:20 +0300)]
x86/speculation/mds: update mds_mitigation to reflect debugfs configuration
If we enable mds_user_clear, we set mds_mitigation to MDS_MITIGATION_FULL or
MDS_MITIGATION_VMWERV. When we disable mds_user_clear, we set mds_mitigation to
MDS_MITIGATION_IDLE if mds_idle_clear, otherwise MDS_MITIGATION_OFF. If we
enable mds_idle_clear, we set mds_mitigation to MDS_MITIGATION_IDLE only if
mds_user_clear is disabled. When we disable mds_idle_clear, we set
mds_mitigation to MDS_MITIGATION_OFF if mds_user_clear is disabled.
Mihai Carabas [Mon, 8 Apr 2019 10:48:09 +0000 (13:48 +0300)]
x86/speculation/mds: fix microcode late loading
In the microcode late loading case we have to:
- clear the CPU bugs related to MDS to be re-evaluated
- add proper evaluation of the MDS state and enable mitigation if necessary.
If the user has enforced off or idle mitigation, we keep it. Also if the
microcode fixes the MDS bug, mitigation will be turned off.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
bugs.c vs bugs_64.c: we do not have arch_smt_update. Squash the check in mds_select_mitigation.
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
Documentation/admin-guide/kernel-parameters.txt
arch/x86/kernel/cpu/bugs.c
bugs.64 vs bugs_64.c: different boot command line parsing code
Documentation/admin-guide/kernel-parameters.txt vs Documentation/kernel-parameters.txt
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jon Masters <jcm@redhat.com>
(cherry picked from commit 3b5d5994ef174daf5f77ba3f101cd879cf0c5fd6)
Move L1TF to a separate directory so the MDS stuff can be added at the
side. Otherwise the all hardware vulnerabilites have their own top level
entry. Should have done that right away.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jon Masters <jcm@redhat.com>
(cherry picked from commit 19049ee5b2543696dd3cb164a4c4e566984f9615)
In virtualized environments it can happen that the host has the microcode
update which utilizes the VERW instruction to clear CPU buffers, but the
hypervisor is not yet updated to expose the X86_FEATURE_MD_CLEAR CPUID bit
to guests.
Introduce an internal mitigation mode VWWERV which enables the invocation
of the CPU buffer clearing even if X86_FEATURE_MD_CLEAR is not set. If the
system has no updated microcode this results in a pointless execution of
the VERW instruction wasting a few CPU cycles. If the microcode is updated,
but not exposed to a guest then the CPU buffers will be cleared.
That said: Virtual Machines Will Eventually Receive Vaccine
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jon Masters <jcm@redhat.com> Tested-by: Jon Masters <jcm@redhat.com>
(cherry picked from commit a2227b3f734fad5ace7c99103d6a7bc020c193fd)
Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com> Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
The changes to arch/x86/kernel/cpu/bugs.c instead need to be made to
arch/x86/kernel/cpu/bugs_64.c.
Add the sysfs reporting file for MDS. It exposes the vulnerability and
mitigation state similar to the existing files for the other speculative
hardware vulnerabilities.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jon Masters <jcm@redhat.com> Tested-by: Jon Masters <jcm@redhat.com>
(cherry picked from commit db366061fff1f76407cb5d1b0975fcc381400cc3)
Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com> Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
The changes to arch/x86/kernel/cpu/bugs.c instead need to be made to
arch/x86/kernel/cpu/bugs_64.c.
X86_HYPER_NATIVE doesn't exist so just leave that change out.
sched_smt_active() does not exist, instead use cpu_smp_control.
hypervisor_is_type replaced with cpu_has_hypervisor
Now that the mitigations are in place, add a command line parameter to
control the mitigation, a mitigation selector function and a SMT update
mechanism.
This is the minimal straight forward initial implementation which just
provides an always on/off mode. The command line parameter is:
mds=[full|off]
This is consistent with the existing mitigations for other speculative
hardware vulnerabilities.
The idle invocation is dynamically updated according to the SMT state of
the system similar to the dynamic update of the STIBP mitigation. The idle
mitigation is limited to CPUs which are only affected by MSBDS and not any
other variant, because the other variants cannot be mitigated on SMT
enabled systems.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jon Masters <jcm@redhat.com> Tested-by: Jon Masters <jcm@redhat.com>
(cherry picked from commit 4cad86e4abd472f637038e0ad70a70d0d7333f83)
Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com> Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Conflicts:
The changes to arch/x86/kernel/cpu/bugs.c instead need to be made to
arch/x86/kernel/cpu/bugs_64.c.