Matt Delco [Thu, 12 Sep 2019 23:07:58 +0000 (16:07 -0700)]
KVM: coalesced_mmio: add bounds checking
The first/last indexes are typically shared with a user app.
The app can change the 'last' index that the kernel uses
to store the next result. This change sanity checks the index
before using it for writing to a potentially arbitrary address.
Signed-off-by: Matt Delco <delco@chromium.org>
Orabug: 30318042
CVE: CVE-2019-14821
[setje: This patch came to UEK while still under embargo.] Signed-off-by: Jan Setje-Eilers <jan.setjeeilers@oracle.com> Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Quinn Tran <qutran@marvell.com> Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com> Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Himanshu Madhani <hmadhani@marvell.com> Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
xen/swiotlb: remember having called xen_create_contiguous_region()
Instead of always calling xen_destroy_contiguous_region() in case the
memory is DMA-able for the used device, do so only in case it has been
made DMA-able via xen_create_contiguous_region() before.
This will avoid a lot of xen_destroy_contiguous_region() calls for
64-bit capable devices.
As the memory in question is owned by swiotlb-xen the PG_owner_priv_1
flag of the first allocated page can be used for remembering.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit b877ac9815a8fe7e5f6d7fdde3dc34652408840a)
Orabug: 30141778 Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
PF_NO_COMPOUND is not used for PAGEFLAG() in uek4
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
range_straddles_page_boundary() is open coding several macros from
include/xen/page.h. Use those instead. Additionally there is no need
to have check_pages_physically_contiguous() as a separate function as
it is used only once, so merge it into range_straddles_page_boundary().
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit bf70726668c6116aa4976e0cc87f470be6268a2f)
Orabug: 30141778 Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
xen/swiotlb: fix condition for calling xen_destroy_contiguous_region()
The condition in xen_swiotlb_free_coherent() for deciding whether to
call xen_destroy_contiguous_region() is wrong: in case the region to
be freed is not contiguous calling xen_destroy_contiguous_region() is
the wrong thing to do: it would result in inconsistent mappings of
multiple PFNs to the same MFN. This will lead to various strange
crashes or data corruption.
Instead of calling xen_destroy_contiguous_region() in that case a
warning should be issued as that situation should never occur.
Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit 50f6393f9654c561df4cdcf8e6cfba7260143601)
Orabug: 30141778 Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Larry Bassel <larry.bassel@oracle.com> Reviewed-by: John Donnelly <john.p.donnelly@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Ming Lei [Wed, 22 Mar 2017 02:14:43 +0000 (10:14 +0800)]
blk-mq: don't complete un-started request in timeout handler
When iterating busy requests in timeout handler,
if the STARTED flag of one request isn't set, that means
the request is being processed in block layer or driver, and
isn't submitted to hardware yet.
In current implementation of blk_mq_check_expired(),
if the request queue becomes dying, un-started requests are
handled as being completed/freed immediately. This way is
wrong, and can cause rq corruption or double allocation[1][2],
when doing I/O and removing&resetting NVMe device at the sametime.
This patch fixes several issues reported by Yi Zhang.
Cc: stable@vger.kernel.org Reported-by: Yi Zhang <yizhan@redhat.com> Tested-by: Yi Zhang <yizhan@redhat.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 95a49603707d982b25d17c5b70e220a05556a2f9)
Orabug: 29903684 Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com>
Conflicts:
block/blk-mq.c
Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Yuchung Cheng <ycheng@google.com> Cc: Yaogong Wang <wygivan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 76f0dcbb5ae1a7c3dbeec13dd98233b8e6b0b32a)
Orabug: 29997352 Signed-off-by: Jacob Wen <jian.w.wen@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
net/ipv4/tcp_input.c
tcp_drop was introduced by d0f2a1c0e4
Tong Chen [Wed, 14 Aug 2019 06:18:17 +0000 (14:18 +0800)]
mm: keep kabi compatibility of may_expand_vm() etc
One of the previous patches:
mm: rework virtual memory accounting
has modifications on the prototype of functions like may_expand_vm() and the
shared_vm field of mm struct, which actually break the compatibility of kabi.
In order to keep kabi compatibility intact, this patch changed the related
function prototype back and renamed the data_vm back to shared_vm.
Orabug: 30145754 Signed-off-by: Tong Chen <tong.c.chen@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Konstantin Khlebnikov [Fri, 20 May 2016 23:57:45 +0000 (16:57 -0700)]
mm: enable RLIMIT_DATA by default with workaround for valgrind
Since commit 84638335900f ("mm: rework virtual memory accounting")
RLIMIT_DATA limits both brk() and private mmap() but this's disabled by
default because of incompatibility with older versions of valgrind.
Valgrind always set limit to zero and fails if RLIMIT_DATA is enabled.
Fortunately it changes only rlim_cur and keeps rlim_max for reverting
limit back when needed.
This patch checks current usage also against rlim_max if rlim_cur is
zero. This is safe because task anyway can increase rlim_cur up to
rlim_max. Size of brk is still checked against rlim_cur, so this part
is completely compatible - zero rlim_cur forbids brk() but allows
private mmap().
Link: http://lkml.kernel.org/r/56A28613.5070104@de.ibm.com Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 30145754 Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
mm/mmap.c
(cherry picked from commit f4fcd55841fc9e46daac553b39361572453c2b88) Signed-off-by: Tong Chen <tong.c.chen@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Konstantin Khlebnikov [Wed, 3 Feb 2016 00:57:43 +0000 (16:57 -0800)]
mm: warn about VmData over RLIMIT_DATA
This patch provides a way of working around a slight regression
introduced by commit 84638335900f ("mm: rework virtual memory
accounting").
Before that commit RLIMIT_DATA have control only over size of the brk
region. But that change have caused problems with all existing versions
of valgrind, because it set RLIMIT_DATA to zero.
This patch fixes rlimit check (limit actually in bytes, not pages) and
by default turns it into warning which prints at first VmData misuse:
"mmap: top (795): VmData 516096 exceed data ulimit 512000. Will be forbidden soon."
Behavior is controlled by boot param ignore_rlimit_data=y/n and by sysfs
/sys/module/kernel/parameters/ignore_rlimit_data. For now it set to "y".
[akpm@linux-foundation.org: tweak kernel-parameters.txt text[ Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Link: http://lkml.kernel.org/r/20151228211015.GL2194@uranus Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Kees Cook <keescook@google.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 30145754
Fixed current pointer dereferencing error
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
mm/mmap.c
(cherry picked from commit d977d56ce5b3e8842236f2f9e7483d4914c9592e) Signed-off-by: Tong Chen <tong.c.chen@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Konstantin Khlebnikov [Thu, 14 Jan 2016 23:22:07 +0000 (15:22 -0800)]
mm: rework virtual memory accounting
When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
testing the RLIMIT_DATA value to figure out if we're allowed to assign
new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
commited that RLIMIT_DATA in a form it's implemented now doesn't do
anything useful because most of user-space libraries use mmap() syscall
for dynamic memory allocations.
Linus suggested to convert RLIMIT_DATA rlimit into something suitable
for anonymous memory accounting. But in this patch we go further, and
the changes are bundled together as:
* keep vma counting if CONFIG_PROC_FS=n, will be used for limits
* replace mm->shared_vm with better defined mm->data_vm
* account anonymous executable areas as executable
* account file-backed growsdown/up areas as stack
* drop struct file* argument from vm_stat_account
* enforce RLIMIT_DATA for size of data areas
This way code looks cleaner: now code/stack/data classification depends
only on vm_flags state:
The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
"shared", but that might be strange beast like readonly-private or VM_IO
area.
- RLIMIT_AS limits whole address space "VmSize"
- RLIMIT_STACK limits stack "VmStk" (but each vma individually)
- RLIMIT_DATA now limits "VmData"
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Willy Tarreau <w@1wt.eu> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Kees Cook <keescook@google.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 30145754 Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/proc/task_mmu.c
mm/mprotect.c
(cherry picked from commit 84638335900f1995495838fe1bd4870c43ec1f67) Signed-off-by: Tong Chen <tong.c.chen@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
(cherry picked from commit 09357814778a38a5ab2d031cba6c9e9fe090c849) Signed-off-by: Tong Chen <tong.c.chen@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Oleg Nesterov [Fri, 6 Nov 2015 02:48:14 +0000 (18:48 -0800)]
mm: fix the racy mm->locked_vm change in
"mm->locked_vm += grow" and vm_stat_account() in acct_stack_growth() are
not safe; multiple threads using the same ->mm can do this at the same
time trying to expans different vma's under down_read(mmap_sem). This
means that one of the "locked_vm += grow" changes can be lost and we can
miss munlock_vma_pages_all() later.
Move this code into the caller(s) under mm->page_table_lock. All other
updates to ->locked_vm hold mmap_sem for writing.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 30145754
(cherry picked from commit 87e8827b37c0c391d9915d0dc6a06c9b5f9cac65) Signed-off-by: Tong Chen <tong.c.chen@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Ming Lei [Thu, 5 Sep 2019 07:55:27 +0000 (15:55 +0800)]
block: loop: fix another reread part failure
loop_clr_fd() can be run piggyback with lo_release(), and
under this situation, reread partition may always fail because
bd_mutex has been held already.
This patch detects the situation by the reference count, and
call __blkdev_reread_part() to avoid acquiring the lock again.
In the meantime, this patch switches to new kernel APIs
of blkdev_reread_part() and __blkdev_reread_part().
Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jarod Wilson <jarod@redhat.com> Acked-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry-picked from commit 06f0e9e68c0d81c7d822a405f6e35686a711c1fe)
Orabug: 30264603 Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Ming Lei [Thu, 5 Sep 2019 07:45:49 +0000 (15:45 +0800)]
block: loop: don't hold lo_ctl_mutex in lo_open
The lo_ctl_mutex is held for running all ioctl handlers, and
in some ioctl handlers, ioctl_by_bdev(BLKRRPART) is called for
rereading partitions, which requires bd_mutex.
So it is easy to cause failure because trylock(bd_mutex) may
fail inside blkdev_reread_part(), and follows the lock context:
blkid or other application:
->open()
->mutex_lock(bd_mutex)
->lo_open()
->mutex_lock(lo_ctl_mutex)
This patch trys to eliminate the ABBA lock dependency by removing
lo_ctl_mutext in lo_open() with the following approach:
1) make lo_refcnt as atomic_t and avoid acquiring lo_ctl_mutex in lo_open():
- for open vs. add/del loop, no any problem because of loop_index_mutex
- freeze request queue during clr_fd, so I/O can't come until
clearing fd is completed, like the effect of holding lo_ctl_mutex
in lo_open
- both open() and release() have been serialized by bd_mutex already
2) don't hold lo_ctl_mutex for decreasing/checking lo_refcnt in
lo_release(), then lo_ctl_mutex is only required for the last release.
Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jarod Wilson <jarod@redhat.com> Acked-by: Jarod Wilson <jarod@redhat.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry-picked from commit f8933667953e8e61bb6104f5ca88e32e85656a93)
Orabug: 30264603 Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Junxiao Bi [Wed, 10 Jul 2019 00:17:19 +0000 (17:17 -0700)]
dm bufio: fix deadlock with loop device
When thin-volume is built on loop device, if available memory is low,
the following deadlock can be triggered:
One process P1 allocates memory with GFP_FS flag, direct alloc fails,
memory reclaim invokes memory shrinker in dm_bufio, dm_bufio_shrink_scan()
runs, mutex dm_bufio_client->lock is acquired, then P1 waits for dm_buffer
IO to complete in __try_evict_buffer().
But this IO may never complete if issued to an underlying loop device
that forwards it using direct-IO, which allocates memory using
GFP_KERNEL (see: do_blockdev_direct_IO()). If allocation fails, memory
reclaim will invoke memory shrinker in dm_bufio, dm_bufio_shrink_scan()
will be invoked, and since the mutex is already held by P1 the loop
thread will hang, and IO will never complete. Resulting in ABBA
deadlock.
Cc: stable@vger.kernel.org Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
(cherry picked from commit bd293d071ffe65e645b4d8104f9d8fe15ea13862)
Orabug: 29964645 Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Shuning Zhang <sunny.s.zhang@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Mikulas Patocka [Wed, 23 Nov 2016 21:52:01 +0000 (16:52 -0500)]
dm bufio: don't take the lock in dm_bufio_shrink_count
dm_bufio_shrink_count() is called from do_shrink_slab to find out how many
freeable objects are there. The reported value doesn't have to be precise,
so we don't need to take the dm-bufio lock.
Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
(cherry picked from commit d12067f428c037b4575aaeb2be00847fc214c24a)
Orabug: 29964645 Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Shuning Zhang <sunny.s.zhang@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
aru kolappan [Tue, 13 Aug 2019 16:50:02 +0000 (09:50 -0700)]
rds: rds-info shows IPv4 address as '0.0.0.0'
When the rds-info command reads IP address from send/retransmit queue,
it gets IPv4 address as '0.0.0.0'. It is due to reading the address
from the wrong offset.
Signed-off-by: aru kolappan <aru.kolappan@oracle.com> Reviewed-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: John Donnelly <john.p.donnelly@oracle.com> Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/dcache.c - Upstream did not have 4th arg NULL to d_walk().
Alejandro Jimenez [Thu, 29 Aug 2019 17:00:14 +0000 (13:00 -0400)]
retpoline: Move retpoline_mode_selected() out of .init.text section
Remove the __init macro from the definition of retpoline_mode_selected(),
since there are functions that call it after initialization. So far
a problem has not occurred because calls to retpoline_mode_selected()
are inlined by the compiler, but this could change in the future.
Ankur Arora [Wed, 14 Aug 2019 22:05:20 +0000 (18:05 -0400)]
xen-netback: use irqsave/irqrestore in xenvif_rx_dequeue()
xenvif_rx_action() acquires, releases queue->rx_lock via spin_lock_irq(),
spin_unlock_irq(). This in-turn calls xenvif_rx_dequeue(), which
acquires, releases a different spinlock, queue->rx_queue.lock, also via
spin_lock_irq(), spin_unlock_irq(). The second set of calls is
problematic because it leads to irqs being enabled early:
Orabug: 30223112 Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Olga Kornievskaia [Wed, 29 May 2019 14:46:00 +0000 (10:46 -0400)]
SUNRPC fix regression in umount of a secure mount
If call_status returns ENOTCONN, we need to re-establish the connection
state after. Otherwise the client goes into an infinite loop of call_encode,
call_transmit, call_status (ENOTCONN), call_encode.
Fixes: c8485e4d63 ("SUNRPC: Handle ECONNREFUSED correctly in xprt_transmit()") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Cc: stable@vger.kernel.org # v2.6.29+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
(cherry picked from commit ec6017d9035986a36de064f48a63245930bfad6f)
Orabug: 29926734 Signed-off-by: Calum Mackay <calum.mackay@oracle.com> Reviewed-by: Bill Baker <bill.baker@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Junxiao Bi [Mon, 5 Aug 2019 17:24:34 +0000 (10:24 -0700)]
block: fix RO partition with RW disk
When md raid1 was used with imsm metadata, during the boot stage,
the raid device will first be set to readonly, then mdmon will set
it read-write later. When there were some partitions in this device,
the following race would make some partition left ro and fail to mount.
CPU 1: CPU 2:
add_partition() set_disk_ro() //set disk RW
//disk was RO, so partition set to RO
p->policy = get_disk_ro(disk);
if (disk->part0.policy != flag) {
set_disk_ro_uevent(disk, flag);
// disk set to RW
disk->part0.policy = flag;
}
// set all exit partition to RW
while ((part = disk_part_iter_next(&piter)))
part->policy = flag;
// this part was not yet added, so it was still RO
rcu_assign_pointer(ptbl->part[partno], p);
Move RO status setting of partitions after they were added into partition
table and introduce a mutex to sync RO status between disk and partitions.
Alejandro Jimenez [Tue, 13 Aug 2019 15:40:58 +0000 (11:40 -0400)]
retpoline: Show correct spectrev2 mitigation after loading non-retpoline module
If a loaded kernel module is not built with retpoline capabilities,
the kernel is tainted and sysfs reports the system as "Vulnerable"
to spectre v2, even though the retpoline mitigation is still enabled.
Change the message displayed in sysfs to report when a non-retpoline
module has been loaded using the new format:
Mitigation: Full generic retpoline (non-retpoline module(s) has been loaded), IBRS_FW, IBPB
This enables more precise tracking of the security status by
differentiating the cases where no spectre v2 mitigation is
available (Vulnerable), and when retpoline is available/active but
a vulnerable module has introduced a potential attack vector.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Julien Gomes [Wed, 25 Oct 2017 18:50:50 +0000 (11:50 -0700)]
tun: allow positive return values on dev_get_valid_name() call
If the name argument of dev_get_valid_name() contains "%d", it will try
to assign it a unit number in __dev__alloc_name() and return either the
unit number (>= 0) or an error code (< 0).
Considering positive values as error values prevent tun device creations
relying this mechanism, therefor we should only consider negative values
as errors here.
Signed-off-by: Julien Gomes <julien@arista.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 30085611
(cherry picked from commit 5c25f65fd1e42685f7ccd80e0621829c105785d9) Signed-off-by: Jacob Wen <jian.w.wen@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Juergen Gross [Wed, 19 Jun 2019 09:00:56 +0000 (11:00 +0200)]
xen: let alloc_xenballooned_pages() fail if not enough memory free
Instead of trying to allocate pages with GFP_USER in
add_ballooned_pages() check the available free memory via
si_mem_available(). GFP_USER is far less limiting memory exhaustion
than the test via si_mem_available().
This will avoid dom0 running out of memory due to excessive foreign
page mappings especially on ARM and on x86 in PVH mode, as those don't
have a pre-ballooned area which can be used for foreign mappings.
As the normal ballooning suffers from the same problem don't balloon
down more than si_mem_available() pages in one iteration. At the same
time limit the default maximum number of retries.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: John Haxby <john.haxby@oracle.com> Reviewed-by: Patrick Colp <patrick.colp@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Igor Redko [Thu, 17 Mar 2016 21:19:05 +0000 (14:19 -0700)]
mm/page_alloc.c: calculate 'available' memory in a separate function
Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
statistics protocol, corresponding to 'Available' in /proc/meminfo.
It indicates to the hypervisor how big the balloon can be inflated
without pushing the guest system to swap. This metric would be very
useful in VM orchestration software to improve memory management of
different VMs under overcommit.
This patch (of 2):
Factor out calculation of the available memory counter into a separate
exportable function, in order to be able to use it in other parts of the
kernel.
In particular, it appears a relevant metric to report to the hypervisor
via virtio-balloon statistics interface (in a followup patch).
Signed-off-by: Igor Redko <redkoi@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit d02bd27bd33dd7e8d22594cd568b81be0cb584cd)
Need this to provide si_mem_available() for the next patch
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Patrick Colp <patrick.colp@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/proc/meminfo.c: we don't have commit 84ad5802a33a ("proc:
meminfo: estimate available memory more conservatively") and
even though it looks reasonable and simple I didn't want to
include it because this would change a user-visible attribute.
The GTCO tablet input driver configures itself from an HID report sent
via USB during the initial enumeration process. Some debugging messages
are generated during the parsing. A debugging message indentation
counter is not bounds checked, leading to the ability for a specially
crafted HID report to cause '-' and null bytes be written past the end
of the indentation array. As long as the kernel has CONFIG_DYNAMIC_DEBUG
enabled, this code will not be optimized out. This was discovered
during code review after a previous syzkaller bug was found in this
driver.
Signed-off-by: Grant Hernandez <granthernandez@google.com> Cc: stable@vger.kernel.org Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
(cherry picked from commit 2a017fd82c5402b3c8df5e3d6e5165d9e6147dc1)
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Alexander Burmashev [Thu, 11 Jul 2019 15:36:47 +0000 (08:36 -0700)]
Documentation/Docbook/Makefile: process xml files in parallel, based on nproc --all value
Documentation/Docbook/Makefile: process xml files in parallel, based on nproc --all value
This commits introduces a few simple changes, that keep execution of doc compilation serial
( docbook documentation Makefile does not support parallelisation ), but allow parallel
production of xml files. Also compression of man files is now happening in parallel as well.
Number of max threads is limited to nproc --all value.
Signed-off-by: Alex Burmashev <alexander.burmashev@oracle.com> Reviewed-by: John Haxby <john.haxby@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
The commit a5ee13ea4ae4 ("rds: IB: fix returned value not set error")
causes this problem. When compliling, the following warning will appear.
"
net/rds/ib.c: In function rds_ib_do_failover
net/rds/ib.c:1386: warning: unused variable ret
"
Fixes: a5ee13ea4ae4 ("rds: IB: fix returned value not set error") Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Junxiao Bi<junxiao.bi@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
xen-netback: stop netif TX queue on guest queuing failure
On a failure to queue an TX skb to the guest's RX io_ring, we return
NETDEV_TX_BUSY without stopping the netif-queue. If the guest's io_ring
is full, this can lead to unbounded retry.
This fix stops the queue when hit this condition and restarts it once
the guest creates space in its RX io_ring. This also hardens against
malicious guests where the guest netfront can cause DoS on the host by
just not putting responses on the RX io_ring.
The guest can send an rx_interrupt even before the vif is fully
provisioned, so do this only after the vif moves to VIF_STATE_CONNECTED.
Junxiao Bi [Mon, 22 Jul 2019 16:15:24 +0000 (09:15 -0700)]
scsi: megaraid_sas: fix panic on loading firmware crashdump
While loading fw crashdump in function fw_crash_buffer_show(), left bytes
in one dma chunk was not checked, if copying size over it, overflow access
will cause kernel panic.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Acked-by: Sumit Saxena <sumit.saxena@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 3b5f307ef3cb5022bfe3c8ca5b8f2114d5bf6c29)
Orabug: 29993112 Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: John Donnelly <John.p.donnnelly@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Thomas Gleixner [Wed, 17 Jul 2019 19:18:59 +0000 (21:18 +0200)]
x86/speculation: Exclude ATOMs from speculation through SWAPGS
Intel provided the following information:
On all current Atom processors, instructions that use a segment register
value (e.g. a load or store) will not speculatively execute before the
last writer of that segment retires. Thus they will not use a
speculatively written segment value.
That means on ATOMs there is no speculation through SWAPGS, so the SWAPGS
entry paths can be excluded from the extra LFENCE if PTI is disabled.
Create a separate bug flag for the through SWAPGS speculation and mark all
out-of-order ATOMs and AMD/HYGON CPUs as not affected. The in-order ATOMs
are excluded from the whole mitigation mess anyway.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
(cherry picked from commit f36cf386e3fec258a341d446915862eded3e13d8)
Orabug: 29967571
CVE: CVE-2019-1125
Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
The previous commit added macro calls in the entry code which mitigate the
Spectre v1 swapgs issue if the X86_FEATURE_FENCE_SWAPGS_* features are
enabled. Enable those features where applicable.
The mitigations may be disabled with "nospectre_v1" or "mitigations=off".
There are different features which can affect the risk of attack:
- When FSGSBASE is enabled, unprivileged users are able to place any
value in GS, using the wrgsbase instruction. This means they can
write a GS value which points to any value in kernel space, which can
be useful with the following gadget in an interrupt/exception/NMI
handler:
if (coming from user space)
swapgs
mov %gs:<percpu_offset>, %reg1
// dependent load or store based on the value of %reg
// for example: mov %(reg1), %reg2
If an interrupt is coming from user space, and the entry code
speculatively skips the swapgs (due to user branch mistraining), it
may speculatively execute the GS-based load and a subsequent dependent
load or store, exposing the kernel data to an L1 side channel leak.
Note that, on Intel, a similar attack exists in the above gadget when
coming from kernel space, if the swapgs gets speculatively executed to
switch back to the user GS. On AMD, this variant isn't possible
because swapgs is serializing with respect to future GS-based
accesses.
NOTE: The FSGSBASE patch set hasn't been merged yet, so the above case
doesn't exist quite yet.
- When FSGSBASE is disabled, the issue is mitigated somewhat because
unprivileged users must use prctl(ARCH_SET_GS) to set GS, which
restricts GS values to user space addresses only. That means the
gadget would need an additional step, since the target kernel address
needs to be read from user space first. Something like:
if (coming from user space)
swapgs
mov %gs:<percpu_offset>, %reg1
mov (%reg1), %reg2
// dependent load or store based on the value of %reg2
// for example: mov %(reg2), %reg3
It's difficult to audit for this gadget in all the handlers, so while
there are no known instances of it, it's entirely possible that it
exists somewhere (or could be introduced in the future). Without
tooling to analyze all such code paths, consider it vulnerable.
Effects of SMAP on the !FSGSBASE case:
- If SMAP is enabled, and the CPU reports RDCL_NO (i.e., not
susceptible to Meltdown), the kernel is prevented from speculatively
reading user space memory, even L1 cached values. This effectively
disables the !FSGSBASE attack vector.
- If SMAP is enabled, but the CPU *is* susceptible to Meltdown, SMAP
still prevents the kernel from speculatively reading user space
memory. But it does *not* prevent the kernel from reading the
user value from L1, if it has already been cached. This is probably
only a small hurdle for an attacker to overcome.
Thanks to Dave Hansen for contributing the speculative_smap() function.
Thanks to Andrew Cooper for providing the inside scoop on whether swapgs
is serializing on AMD.
[ tglx: Fixed the USER fence decision and polished the comment as suggested
by Dave Hansen ]
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dave Hansen <dave.hansen@intel.com>
(cherry picked from commit a2059825986a1c8143fd6698774fa9d83733bb11)
Orabug: 29967571
CVE: CVE-2019-1125
Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
kernel-parameters.txt The file location is different. Manual
edit to file.
bugs.c The changes are manually ported to
bugs_64.c
x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations
Spectre v1 isn't only about array bounds checks. It can affect any
conditional checks. The kernel entry code interrupt, exception, and NMI
handlers all have conditional swapgs checks. Those may be problematic in
the context of Spectre v1, as kernel code can speculatively run with a user
GS.
For example:
if (coming from user space)
swapgs
mov %gs:<percpu_offset>, %reg
mov (%reg), %reg1
When coming from user space, the CPU can speculatively skip the swapgs, and
then do a speculative percpu load using the user GS value. So the user can
speculatively force a read of any kernel value. If a gadget exists which
uses the percpu value as an address in another load/store, then the
contents of the kernel value may become visible via an L1 side channel
attack.
A similar attack exists when coming from kernel space. The CPU can
speculatively do the swapgs, causing the user GS to get used for the rest
of the speculative window.
The mitigation is similar to a traditional Spectre v1 mitigation, except:
a) index masking isn't possible; because the index (percpu offset)
isn't user-controlled; and
b) an lfence is needed in both the "from user" swapgs path and the
"from kernel" non-swapgs path (because of the two attacks described
above).
The user entry swapgs paths already have SWITCH_TO_KERNEL_CR3, which has a
CR3 write when PTI is enabled. Since CR3 writes are serializing, the
lfences can be skipped in those cases.
On the other hand, the kernel entry swapgs paths don't depend on PTI.
To avoid unnecessary lfences for the user entry case, create two separate
features for alternative patching:
Use these features in entry code to patch in lfences where needed.
The features aren't enabled yet, so there's no functional change.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dave Hansen <dave.hansen@intel.com>
(cherry picked from commit 18ec54fdd6d18d92025af097cd042a75cf0ea24c)
Orabug: 29967571
CVE: CVE-2019-1125
Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
cpufeatures.h Changes implemented in cpufeature.h.
X86_FEATURE_FENCE_SWAPGS_USER and
X86_FEATURE_FENCE_SWAPGS_KERNEL were assigned
bits in word 11 after some cleaning up of
words 11 and 12 in the patchsets 1674293
and dfe8715. However, this breaks the
kABI for UEK kernels. Use the free bits
in word 2 instead.
calling.h Context
entry_64.S The code is significantly different and the
changes were done manually.
mlx4_core: change log_num_{qp,rdmarc} with scale_profile
When module parameter 'scale_profile' is set we
use different (than default) parameter values for
certain mlx4_core module parameters which define
some resource limits.
This changes 'log_num_qp' and 'log_num_rdmarc'
to lower values to fix some issues leading to
undesirable HCA resource usage. These led to
undesirable behavior in conjunction with current
round-robin allocation of QPs where some
long-lasting QPs "polluted" ICM memory chunks.
Cathy Avery [Tue, 19 Dec 2017 18:32:48 +0000 (13:32 -0500)]
scsi: storvsc: Fix scsi_cmd error assignments in storvsc_handle_error
When an I/O is returned with an srb_status of SRB_STATUS_INVALID_LUN
which has zero good_bytes it must be assigned an error. Otherwise the
I/O will be continuously requeued and will cause a deadlock in the case
where disks are being hot added and removed. sd_probe_async will wait
forever for its I/O to complete while holding scsi_sd_probe_domain.
Also returning the default error of DID_TARGET_FAILURE causes multipath
to not retry the I/O resulting in applications receiving I/O errors
before a failover can occur.
Signed-off-by: Cathy Avery <cavery@redhat.com> Signed-off-by: Long Li <longli@microsoft.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit d1b8b2391c24751e44f618fcf86fb55d9a9247fd)
Orabug: 30052805 Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
HÃ¥kon Bugge [Thu, 6 Jun 2019 12:00:05 +0000 (14:00 +0200)]
rds: ib: Fix dereference of conn when NULL and cleanup thereof
When rds_ib_cm_connect_complete() and rds_ib_flush_arp_entry() is
called from rds_rdma_cm_event_handler_cmn(), conn may be NULL and a
NULL pointer dereference will happen.
Also cleaned the code by performing the NULL check once.
Sriram Rajagopalan [Fri, 10 May 2019 23:28:06 +0000 (19:28 -0400)]
ext4: zero out the unused memory region in the extent tree block
This commit zeroes out the unused memory region in the buffer_head
corresponding to the extent metablock after writing the extent header
and the corresponding extent node entries.
This is done to prevent random uninitialized data from getting into
the filesystem when the extent block is synced.
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Gen Zhang [Fri, 24 May 2019 03:24:26 +0000 (11:24 +0800)]
ip_sockglue: Fix missing-check bug in ip_ra_control()
In function ip_ra_control(), the pointer new_ra is allocated a memory
space via kmalloc(). And it is used in the following codes. However,
when there is a memory allocation error, kmalloc() fails. Thus null
pointer dereference may happen. And it will cause the kernel to crash.
Therefore, we should check the return value and handle the error.
Signed-off-by: Gen Zhang <blackgod016574@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 425aa0e1d01513437668fa3d4a971168bbaa8515)
Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Gen Zhang [Fri, 24 May 2019 03:19:46 +0000 (11:19 +0800)]
ipv6_sockglue: Fix a missing-check bug in ip6_ra_control()
In function ip6_ra_control(), the pointer new_ra is allocated a memory
space via kmalloc(). And it is used in the following codes. However,
when there is a memory allocation error, kmalloc() fails. Thus null
pointer dereference may happen. And it will cause the kernel to crash.
Therefore, we should check the return value and handle the error.
Signed-off-by: Gen Zhang <blackgod016574@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 95baa60a0da80a0143e3ddd4d3725758b4513825)
Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Mihai Carabas [Fri, 21 Jun 2019 13:33:30 +0000 (16:33 +0300)]
x86/microcode: fix x86_spec_ctrl_mask on late loading.
Fixes: d447e92c7d0a ("x86/microcode: add SPEC_CTRL_SSBD to x86_spec_ctrl_mask on late loading.")
This fixes potential hypervisor panic on guest writes to IA32_SPEC_CTRL after
microcode updates. We made available SSBD if we had a bug in the CPU. The
correct approach is to make available SSBD if we have the flag available.
In the function rds_ib_inc_free, when rds frags size is changed,
for example, firstly 4K frag is used, then 16K frag is used (for
example during uek2 to uek4 upgrade), this will make
"sg_total_lens(frag->f_sg) != ic->i_frag_sz" true.
In this case, only frag pages are freed while frag is not freed.
This will cause memory leak.
Hannes Reinecke [Fri, 30 Sep 2016 09:01:15 +0000 (11:01 +0200)]
scsi: libfc: Fixup disc_mutex handling in fcoe module
The list of attached 'rdata' remote port structures is RCU
protected, so there is no need to take the 'disc_mutex' when
traversing it.
Rather we should be using rcu_read_lock() and kref_get_unless_zero()
to validate the entries.
We need, however, take the disc_mutex when deleting an entry;
otherwise we risk clashes with list_add.
Signed-off-by: Hannes Reinecke <hare@suse.com> Acked-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 29511036
(cherry picked from commit a407c593398c886db4fa1fc5c6fec55e61187a09) Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Hannes Reinecke [Thu, 13 Oct 2016 13:10:41 +0000 (15:10 +0200)]
scsi: libfc: sanitize E_D_TOV and R_A_TOV setting in fcp
When setting the FCP timeout we need to ensure a lower boundary
for E_D_TOV and R_A_TOV, otherwise we'd be getting spurious I/O
issues due to the fcp timer firing too early.
Signed-off-by: Hannes Reinecke <hare@suse.com> Acked-by: Johannes Thumshirn <jth@kernel.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 29511036
(cherry picked from commit 76e72ad117812bb79abf647ac40ca6df1740b729) Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Eric W. Biederman [Thu, 6 Jul 2017 13:41:06 +0000 (08:41 -0500)]
proc: Fix proc_sys_prune_dcache to hold a sb reference
Andrei Vagin writes:
FYI: This bug has been reproduced on 4.11.7
> BUG: Dentry ffff895a3dd01240{i=4e7c09a,n=lo} still in use (1) [unmount of proc proc]
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 13588 at fs/dcache.c:1445 umount_check+0x6e/0x80
> CPU: 1 PID: 13588 Comm: kworker/1:1 Not tainted 4.11.7-200.fc25.x86_64 #1
> Hardware name: CompuLab sbc-flt1/fitlet, BIOS SBCFLT_0.08.04 06/27/2015
> Workqueue: events proc_cleanup_work
> Call Trace:
> dump_stack+0x63/0x86
> __warn+0xcb/0xf0
> warn_slowpath_null+0x1d/0x20
> umount_check+0x6e/0x80
> d_walk+0xc6/0x270
> ? dentry_free+0x80/0x80
> do_one_tree+0x26/0x40
> shrink_dcache_for_umount+0x2d/0x90
> generic_shutdown_super+0x1f/0xf0
> kill_anon_super+0x12/0x20
> proc_kill_sb+0x40/0x50
> deactivate_locked_super+0x43/0x70
> deactivate_super+0x5a/0x60
> cleanup_mnt+0x3f/0x90
> mntput_no_expire+0x13b/0x190
> kern_unmount+0x3e/0x50
> pid_ns_release_proc+0x15/0x20
> proc_cleanup_work+0x15/0x20
> process_one_work+0x197/0x450
> worker_thread+0x4e/0x4a0
> kthread+0x109/0x140
> ? process_one_work+0x450/0x450
> ? kthread_park+0x90/0x90
> ret_from_fork+0x2c/0x40
> ---[ end trace e1c109611e5d0b41 ]---
> VFS: Busy inodes after unmount of proc. Self-destruct in 5 seconds. Have a nice day...
> BUG: unable to handle kernel NULL pointer dereference at (null)
> IP: _raw_spin_lock+0xc/0x30
> PGD 0
Fix this by taking a reference to the super block in proc_sys_prune_dcache.
The superblock reference is the core of the fix however the sysctl_inodes
list is converted to a hlist so that hlist_del_init_rcu may be used. This
allows proc_sys_prune_dache to remove inodes the sysctl_inodes list, while
not causing problems for proc_sys_evict_inode when if it later choses to
remove the inode from the sysctl_inodes list. Removing inodes from the
sysctl_inodes list allows proc_sys_prune_dcache to have a progress
guarantee, while still being able to drop all locks. The fact that
head->unregistering is set in start_unregistering ensures that no more
inodes will be added to the the sysctl_inodes list.
Previously the code did a dance where it delayed calling iput until the
next entry in the list was being considered to ensure the inode remained on
the sysctl_inodes list until the next entry was walked to. The structure
of the loop in this patch does not need that so is much easier to
understand and maintain.
Cc: stable@vger.kernel.org Reported-by: Andrei Vagin <avagin@gmail.com> Tested-by: Andrei Vagin <avagin@openvz.org> Fixes: ace0c791e6c3 ("proc/sysctl: Don't grab i_lock under sysctl_lock.") Fixes: d6cffbbe9a7e ("proc/sysctl: prune stale dentries during unregistering") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
(cherry picked from commit 2fd1d2c4ceb2248a727696962cf3370dc9f5a0a4)
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Eric W. Biederman [Mon, 20 Feb 2017 05:17:03 +0000 (18:17 +1300)]
proc/sysctl: Don't grab i_lock under sysctl_lock.
Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
> This patch has locking problem. I've got lockdep splat under LTP.
>
> [ 6633.115456] ======================================================
> [ 6633.115502] [ INFO: possible circular locking dependency detected ]
> [ 6633.115553] 4.9.10-debug+ #9 Tainted: G L
> [ 6633.115584] -------------------------------------------------------
> [ 6633.115627] ksm02/284980 is trying to acquire lock:
> [ 6633.115659] (&sb->s_type->i_lock_key#4){+.+...}, at: [<ffffffff816bc1ce>] igrab+0x1e/0x80
> [ 6633.115834] but task is already holding lock:
> [ 6633.115882] (sysctl_lock){+.+...}, at: [<ffffffff817e379b>] unregister_sysctl_table+0x6b/0x110
> [ 6633.116026] which lock already depends on the new lock.
> [ 6633.116026]
> [ 6633.116080]
> [ 6633.116080] the existing dependency chain (in reverse order) is:
> [ 6633.116117]
> -> #2 (sysctl_lock){+.+...}:
> -> #1 (&(&dentry->d_lockref.lock)->rlock){+.+...}:
> -> #0 (&sb->s_type->i_lock_key#4){+.+...}:
>
> d_lock nests inside i_lock
> sysctl_lock nests inside d_lock in d_compare
>
> This patch adds i_lock nesting inside sysctl_lock.
Al Viro <viro@ZenIV.linux.org.uk> replied:
> Once ->unregistering is set, you can drop sysctl_lock just fine. So I'd
> try something like this - use rcu_read_lock() in proc_sys_prune_dcache(),
> drop sysctl_lock() before it and regain after. Make sure that no inodes
> are added to the list ones ->unregistering has been set and use RCU list
> primitives for modifying the inode list, with sysctl_lock still used to
> serialize its modifications.
>
> Freeing struct inode is RCU-delayed (see proc_destroy_inode()), so doing
> igrab() is safe there. Since we don't drop inode reference until after we'd
> passed beyond it in the list, list_for_each_entry_rcu() should be fine.
I agree with Al Viro's analsysis of the situtation.
Fixes: d6cffbbe9a7e ("proc/sysctl: prune stale dentries during unregistering") Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Tested-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
(cherry picked from commit ace0c791e6c3cf5ef37cad2df69f0d90ccc40ffb)
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
All of them have matching names thus lookup have to scan though whole
hash chain and call d_compare (proc_sys_compare) which checks them
under system-wide spinlock (sysctl_lock).
# time sysctl -a > /dev/null
real 1m12.806s
user 0m0.016s
sys 1m12.400s
Currently only memory reclaimer could remove this garbage.
But without significant memory pressure this never happens.
This patch collects sysctl inodes into list on sysctl table header and
prunes all their dentries once that table unregisters.
Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
> On 10.02.2017 10:47, Al Viro wrote:
>> how about >> the matching stats *after* that patch?
>
> dcache size doesn't grow endlessly, so stats are fine
>
> # sysctl fs.dentry-state
> fs.dentry-state = 92712 58376 45 0 0 0
>
> # time sysctl -a &>/dev/null
>
> real 0m0.013s
> user 0m0.004s
> sys 0m0.008s
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
(cherry picked from commit d6cffbbe9a7e51eb705182965a189457c17ba8a3)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/proc/proc_sysctl.c Context has changed
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Problem:
The Linux kernel takes a logical volume offline after a LUN reset. This is
generally accompanied by this message in the dmesg output:
Device offlined - not ready after error recovery
Root Cause:
The root cause is a "quirk" in the timeout handling in the Linux SCSI
layer. The Linux kernel places a 30-second timeout on most media access
commands (reads and writes) that it send to device drivers. When a media
access command times out, the Linux kernel goes into error recovery mode
for the LUN that was the target of the command that timed out. Every
command that timed out is kept on a list inside of the Linux kernel to be
retried later. The kernel attempts to recover the command(s) that timed out
by issuing a LUN reset followed by a TEST UNIT READY. If the LUN reset and
TEST UNIT READY commands are successful, the kernel retries the command(s)
that timed out.
Each SCSI command issued by the kernel has a result field associated with
it. This field indicates the final result of the command (success or
error). When a command times out, the kernel places a value in this result
field indicating that the command timed out.
The "quirk" is that after the LUN reset and TEST UNIT READY commands are
completed, the kernel checks each command on the timed-out command list
before retrying it. If the result field is still "timed out", the kernel
treats that command as not having been successfully recovered for a
retry. If the number of commands that are in this state are greater than
two, the kernel takes the LUN offline.
Fix:
When our RAIDStack receives a LUN reset, it simply waits until all
outstanding commands complete. Generally, all of these outstanding commands
complete successfully. Therefore, the fix in the smartpqi driver is to
always set the command result field to indicate success when a request
completes successfully. This normally isn’t necessary because the result
field is always initialized to success when the command is submitted to the
driver. So when the command completes successfully, the result field is
left untouched. But in this case, the kernel changes the result field
behind the driver’s back and then expects the field to be changed by the
driver as the commands that timed-out complete.
Reviewed-by: Dave Carroll <david.carroll@microsemi.com> Reviewed-by: Scott Teel <scott.teel@microsemi.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
Orabug: 29848621
(cherry picked from commit ceecd0c696456068c2a0e1f0d3c32e5290d115ab) Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
This changes the fork(2) syscall to record the process start_time after
initializing the basic task structure but still before making the new
process visible to user-space.
Technically, we could record the start_time anytime during fork(2). But
this might lead to scenarios where a start_time is recorded long before
a process becomes visible to user-space. For instance, with
userfaultfd(2) and TLS, user-space can delay the execution of fork(2)
for an indefinite amount of time (and will, if this causes network
access, or similar).
By recording the start_time late, it much closer reflects the point in
time where the process becomes live and can be observed by other
processes.
Lastly, this makes it much harder for user-space to predict and control
the start_time they get assigned. Previously, user-space could fork a
process and stall it in copy_thread_tls() before its pid is allocated,
but after its start_time is recorded. This can be misused to later-on
cycle through PIDs and resume the stalled fork(2) yielding a process
that has the same pid and start_time as a process that existed before.
This can be used to circumvent security systems that identify processes
by their pid+start_time combination.
Even though user-space was always aware that start_time recording is
flaky (but several projects are known to still rely on start_time-based
identification), changing the start_time to be recorded late will help
mitigate existing attacks and make it much harder for user-space to
control the start_time a process gets assigned.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Tom Gundersen <teg@jklm.no> Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3f2e4e1d9a6cffa95d31b7a491243d5e92a82507)
Signed-off-by: John Donnelly <john.p.donnelly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
mm: avoid taking zone lock in pagetypeinfo_showmixed()
pagetypeinfo_showmixedcount_print is found to take a lot of time to
complete and it does this holding the zone lock and disabling
interrupts. In some cases it is found to take more than a second (On a
2.4GHz,8Gb RAM,arm64 cpu).
Avoid taking the zone lock similar to what is done by read_page_owner,
which means possibility of inaccurate results.
Link: http://lkml.kernel.org/r/1498045643-12257-1-git-send-email-vinmenon@codeaurora.org Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: zhongjiang <zhongjiang@huawei.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 29905302
(cherry picked from commit 727c080f03e7e2e20e868efd461d4f1022b61d9b) Reviewed-by: Joe Jin <joe.jin@oracle.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Tong Chen <tong.c.chen@oracle.com>
Conflicts:
mm/page_owner.c
mm/vmstat.c
Ankur Arora [Wed, 12 Jun 2019 23:01:17 +0000 (19:01 -0400)]
x86/retpoline/ia32entry: Convert to non-speculative calls
Convert indirect jumps in 32-bit compat entry assembler code to use
non-speculative sequences when CONFIG_RETPOLINE is enabled.
The ia32entry code does not care about the length of the CALL_NOSPEC
fragment, so unlike similar indirect callsites in entry_64.S we use
CALL_NOSPEC everywhere.
Cong Wang [Fri, 13 Oct 2017 18:58:53 +0000 (11:58 -0700)]
tun: call dev_get_valid_name() before register_netdevice()
register_netdevice() could fail early when we have an invalid
dev name, in which case ->ndo_uninit() is not called. For tun
device, this is a problem because a timer etc. are already
initialized and it expects ->ndo_uninit() to clean them up.
We could move these initializations into a ->ndo_init() so
that register_netdevice() knows better, however this is still
complicated due to the logic in tun_detach().
Therefore, I choose to just call dev_get_valid_name() before
register_netdevice(), which is quicker and much easier to audit.
And for this specific case, it is already enough.
Fixes: 96442e42429e ("tuntap: choose the txq based on rxq") Reported-by: Dmitry Alexeev <avekceeb@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 0ad646c81b2182f7fa67ec0c8c825e0ee165696d)
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
chenjie [Thu, 30 Nov 2017 00:10:54 +0000 (16:10 -0800)]
mm/madvise.c: fix madvise() infinite loop under special circumstances
MADVISE_WILLNEED has always been a noop for DAX (formerly XIP) mappings.
Unfortunately madvise_willneed() doesn't communicate this information
properly to the generic madvise syscall implementation. The calling
convention is quite subtle there. madvise_vma() is supposed to either
return an error or update &prev otherwise the main loop will never
advance to the next vma and it will keep looping for ever without a way
to get out of the kernel.
It seems this has been broken since introduction. Nobody has noticed
because nobody seems to be using MADVISE_WILLNEED on these DAX mappings.
[mhocko@suse.com: rewrite changelog] Link: http://lkml.kernel.org/r/20171127115318.911-1-guoxuenan@huawei.com Fixes: fe77ba6f4f97 ("[PATCH] xip: madvice/fadvice: execute in place") Signed-off-by: chenjie <chenjie6@huawei.com> Signed-off-by: guoxuenan <guoxuenan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: zhangyi (F) <yi.zhang@huawei.com> Cc: Miao Xie <miaoxie@huawei.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Shaohua Li <shli@fb.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 6ea8d958a2c95a1d514015d4e29ba21a8c0a1a91)
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Venkat Venkatsubra [Wed, 19 Jun 2019 15:15:38 +0000 (08:15 -0700)]
vxlan: fix use-after-free on deletion (part 2)
After commit 15a48cc22bf03434 (vxlan: fix use-after-free on deletion)
vxlan_dellink doesn't need to take out the vxlan_dev from the hlist.
Because, that is done in vxlan_vs_del_dev now.
Having it at both the places results in system crash sometimes with
the following stack trace
Suggested-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Mark Bloch <markb@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 57d88182ea3e8763111882671fd7462289272f64) Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
pravin shelar [Fri, 28 Oct 2016 16:59:15 +0000 (09:59 -0700)]
vxlan: avoid using stale vxlan socket.
When vxlan device is closed vxlan socket is freed. This
operation can race with vxlan-xmit function which
dereferences vxlan socket. Following patch uses RCU
mechanism to avoid this situation.
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c6fcc4fc5f8b592600c7409e769ab68da0fb1eca) Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/vxlan.c
Mihai Carabas [Wed, 3 Apr 2019 21:50:13 +0000 (23:50 +0200)]
x86/microcode: add SPEC_CTRL_SSBD to x86_spec_ctrl_mask on late loading.
This is required so that we don't filter out the SPEC_CTRL_SSBD bit from
what the guest is trying to write to the MSR_IA32_SPEC_CTRL in
x86_virt_spec_ctrl(). Failure to do would make it look like the guest
correctly enabled SSBD when it did not, as reading back the MSR from the
guest would not show the bit was filtered out, giving a false sense of
security.
Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com> Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Alan Jenkins [Thu, 12 Apr 2018 18:11:58 +0000 (19:11 +0100)]
block: do not use interruptible wait anywhere
When blk_queue_enter() waits for a queue to unfreeze, or unset the
PREEMPT_ONLY flag, do not allow it to be interrupted by a signal.
The PREEMPT_ONLY flag was introduced later in commit 3a0a529971ec
("block, scsi: Make SCSI quiesce and resume work reliably"). Note the SCSI
device is resumed asynchronously, i.e. after un-freezing userspace tasks.
So that commit exposed the bug as a regression in v4.15. A mysterious
SIGBUS (or -EIO) sometimes happened during the time the device was being
resumed. Most frequently, there was no kernel log message, and we saw Xorg
or Xwayland killed by SIGBUS.[1]
[1] E.g. https://bugzilla.redhat.com/show_bug.cgi?id=1553979
Without this fix, I get an IO error in this test:
while killall -SIGUSR1 dd; do sleep 0.1; done & \
echo mem > /sys/power/state ; \
sleep 5; killall dd # stop after 5 seconds
The interruptible wait was added to blk_queue_enter in
commit 3ef28e83ab15 ("block: generic request_queue reference counting").
Before then, the interruptible wait was only in blk-mq, but I don't think
it could ever have been correct.
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: stable@vger.kernel.org Signed-off-by: Alan Jenkins <alan.christopher.jenkins@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 1dc3039bc87ae7d19a990c3ee71cfd8a9068f428)
Mark Bloch [Fri, 2 Jun 2017 00:24:08 +0000 (03:24 +0300)]
vxlan: fix use-after-free on deletion
Adding a vxlan interface to a socket isn't symmetrical, while adding
is done in vxlan_open() the deletion is done in vxlan_dellink().
This can cause a use-after-free error when we close the vxlan
interface before deleting it.
We add vxlan_vs_del_dev() to match vxlan_vs_add_dev() and call
it from vxlan_stop() to match the call from vxlan_open().
Fixes: 56ef9c909b40 ("vxlan: Move socket initialization to within rtnl scope") Acked-by: Jiri Benc <jbenc@redhat.com> Tested-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Mark Bloch <markb@mellanox.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit a53cb29b0af346af44e4abf13d7e59f807fba690)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/vxlan.c
vxlan: reduce usage of synchronize_net in ndo_stop
We only need to do the synchronize_net dance once for both, ipv4 and
ipv6 sockets, thus removing one synchronize_net in case both sockets get
dismantled.
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 544a773a01828e3cc3b553721f68d880d0d27a97)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
vxlan: synchronously and race-free destruction of vxlan sockets
Due to the fact that the udp socket is destructed asynchronously in a
work queue, we have some nondeterministic behavior during shutdown of
vxlan tunnels and creating new ones. Fix this by keeping the destruction
process synchronous in regards to the user space process so IFF_UP can
be reliably set.
udp_tunnel_sock_release destroys vs->sock->sk if reference counter
indicates so. We expect to have the same lifetime of vxlan_sock and
vxlan_sock->sock->sk even in fast paths with only rcu locks held. So
only destruct the whole socket after we can be sure it cannot be found
by searching vxlan_net->sock_list.
Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jiri Benc <jbenc@redhat.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 0412bd931f5f94d1054e958415c4a945d8ee62f4)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
vxlan: support both IPv4 and IPv6 sockets in a single vxlan device
For metadata based vxlan interface, open both IPv4 and IPv6 socket. This is
much more user friendly: it's not necessary to create two vxlan interfaces
and pay attention to using the right one in routing rules.
Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit b1be00a6c39fda2ec380e168d7bcf96fb8c9da42)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/vxlan.c
Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 205f356d165033443793a97a668a203a79a8723a)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Venkat Venkatsubra [Mon, 10 Jun 2019 03:24:50 +0000 (20:24 -0700)]
openvswitch: Re-add CONFIG_OPENVSWITCH_VXLAN
This readds the config option CONFIG_OPENVSWITCH_VXLAN to avoid a
hard dependency of OVS on VXLAN. It moves the VXLAN config compat
code to vport-vxlan.c and allows compliation as a module.
Fixes: 614732eaa12d ("openvswitch: Use regular VXLAN net_device device") Fixes: 2661371ace96 ("openvswitch: fix compilation when vxlan is a module") Cc: Pravin B Shelar <pshelar@nicira.com> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit dcc38c033b32b81b88b798f0c0b8453839ac996b)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
net/openvswitch/vport-netdev.c
Venkat Venkatsubra [Mon, 10 Jun 2019 01:53:13 +0000 (18:53 -0700)]
openvswitch: Use regular VXLAN net_device device
This gets rid of all OVS specific VXLAN code in the receive and
transmit path by using a VXLAN net_device to represent the vport.
Only a small shim layer remains which takes care of handling the
VXLAN specific OVS Netlink configuration.
Unexports vxlan_sock_add(), vxlan_sock_release(), vxlan_xmit_skb()
since they are no longer needed.
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 614732eaa12dd462c0ab274700bed14f36afea5e)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/vxlan.c
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c9db965c524ea27451e60d5ddcd242f6c33a70fd)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Thomas Graf [Tue, 21 Jul 2015 08:44:04 +0000 (10:44 +0200)]
openvswitch: Move dev pointer into vport itself
This is the first step in representing all OVS vports as regular
struct net_devices. Move the net_device pointer into the vport
structure itself to get rid of struct vport_netdev.
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit be4ace6e6b1bc12e18b25fe764917e09a1f96d7b)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Thomas Graf [Tue, 21 Jul 2015 08:43:54 +0000 (10:43 +0200)]
ip_tunnel: Make ovs_tunnel_info and ovs_key_ipv4_tunnel generic
Rename the tunnel metadata data structures currently internal to
OVS and make them generic for use by all IP tunnels.
Both structures are kernel internal and will stay that way. Their
members are exposed to user space through individual Netlink
attributes by OVS. It will therefore be possible to extend/modify
these structures without affecting user ABI.
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 1d8fff907342d2339796dbd27ea47d0e76a6a2d0)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 0dfbdf4102b9303d3ddf2177c0220098ff99f6de)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/vxlan.c
Isaac Chen [Wed, 15 May 2019 01:09:18 +0000 (18:09 -0700)]
kexec: generate VMCOREINFO for module symbols
This commit is one of the three commit for generating VMCOREINFO
symbol information. See comments in previous two commits.
To dump kallsyms, module symbols must also be included. The
symbol table of each module can be located through the module
list. This commit generates necessary symbol information for
accessing the symbol tables of loaded modules.
Orabug: 29770217 Signed-off-by: Isaac Chen <isaac.chen@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Isaac Chen [Wed, 15 May 2019 00:06:26 +0000 (17:06 -0700)]
kexec: generate VMCOREINFO for tasks and pid
This commit is the continuation of the previous commit.
For more information see the previous commit comments.
This commit changes the VMCOREINFO buffer size from 4k to 8k, and
generates more symbol information for tasks and pid.
Orabug: 29770217 Signed-off-by: Isaac Chen <isaac.chen@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Isaac Chen [Tue, 14 May 2019 22:29:15 +0000 (15:29 -0700)]
kexec: generate VMCOREINFO for trace dump
The goal for this commit (and the next two) is to enable tools
to dump the trace buffer from /proc/vmcore at kdump time, without
requiring the kernel debug info package being installed.
makedumpfile processes VMCOREINFO section to dump dmesg buffer.
Extending VMCOREINFO to facilitate dumping trace buffer can be
very helpful in diagnosing system crash.
This is part one of three commits to generate symbol information
into VMCOREINFO section.
Orabug: 29770217 Signed-off-by: Isaac Chen <isaac.chen@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Joao Martins [Mon, 10 Jun 2019 22:12:38 +0000 (23:12 +0100)]
tcp: fix fack_count accounting on tcp_shift_skb_data()
v4.15 or since commit 737ff314563 ("tcp: use sequence distance to
detect reordering") had switched from the packet-based FACK tracking
to sequence-based.
v4.14 and older still have the old logic and hence on
tcp_skb_shift_data() needs to retain its original logic and have
@fack_count in sync. In other words, we keep the increment of pcount with
tcp_skb_pcount(skb) to later used that to update fack_count. To make it
more explicit we track the new skb that gets incremented to pcount in
@next_pcount, and we get to avoid the constant invocation of
tcp_skb_pcount(skb) all together.
Orabug: 29890820 Fixes: 1b56e4cb5dec ("tcp: limit payload size of sacked skbs") Reported-by: Alexey Kodanev <alexey.kodanev@oracle.com> Reviewed-by: John Haxby <john.haxby@oracle.com> Reviewed-by: Rao Shoaib <rao.shoaib@oracle.com>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Eric Dumazet [Sat, 8 Jun 2019 00:23:41 +0000 (17:23 -0700)]
tcp: add tcp_min_snd_mss sysctl
Some TCP peers announce a very small MSS option in their SYN and/or
SYN/ACK messages.
This forces the stack to send packets with a very high network/cpu
overhead.
Linux has enforced a minimal value of 48. Since this value includes
the size of TCP options, and that the options can consume up to 40
bytes, this means that each segment can include only 8 bytes of payload.
In some cases, it can be useful to increase the minimal value
to a saner value.
We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility
reasons.
Note that TCP_MAXSEG socket option enforces a minimal value
of (TCP_MIN_MSS). David Miller increased this minimal value
in commit c39508d6f118 ("tcp: Make TCP_MAXSEG minimum more correct.")
from 64 to 88.
We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS.
Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Jonathan Looney <jtl@netflix.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Bruce Curtis <brucec@netflix.com>
Orabug: 29884306 Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[For backport we had to wrap sysctl_tcp_min_snd_mss from struct netns_ipv4
in the __GENKSYMS__ gunk. There is a nice 4 byte hole in it, so we fit it
within that structure. The kernel is the one responsible for creating the
structure so no danger of third-party drivers handing us a shorter structure.
Eric Dumazet [Fri, 7 Jun 2019 23:10:08 +0000 (16:10 -0700)]
tcp: tcp_fragment() should apply sane memory limits
Jonathan Looney reported that a malicious peer can force a sender
to fragment its retransmit queue into tiny skbs, inflating memory
usage and/or overflow 32bit counters.
TCP allows an application to queue up to sk_sndbuf bytes,
so we need to give some allowance for non malicious splitting
of retransmit queue.
A new SNMP counter is added to monitor how many times TCP
did not allow to split an skb if the allowance was exceeded.
Note that this counter might increase in the case applications
use SO_SNDBUF socket option to lower sk_sndbuf.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Cc: Bruce Curtis <brucec@netflix.com>
Orabug: 29884306 Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Eric Dumazet [Fri, 7 Jun 2019 22:51:09 +0000 (15:51 -0700)]
tcp: limit payload size of sacked skbs
Jonathan Looney reported that TCP can trigger the following crash
in tcp_shifted_skb() :
BUG_ON(tcp_skb_pcount(skb) < pcount);
This can happen if the remote peer has advertized the smallest
MSS that linux TCP accepts : 48
An skb can hold 17 fragments, and each fragment can hold 32KB
on x86, or 64KB on PowerPC.
This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
can overflow.
Note that tcp_sendmsg() builds skbs with less than 64KB
of payload, so this problem needs SACK to be enabled.
SACK blocks allow TCP to coalesce multiple skbs in the retransmit
queue, thus filling the 17 fragments to maximal capacity.
Fixes: 832d11c5cd07 ("tcp: Try to restore large SKBs while SACK processing") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Bruce Curtis <brucec@netflix.com>
tcp_collapse_retrans() is quite different in UEK4 and needs special
review. No change is needed in tcp_collapse_retrans() because
in UEK4 it is called only after checking with skb_availroom() for
available room and that the skb is linear. That is not the case in later
releases.
Arguments to tcp_shifted_skb() are different compared to the original patch,
but the difference is inconsequential to issue being addressed.
In UEK4, TCP_SKB_CB(skb)->tcp_gso_segs is 32 bits.
So the original 16-bit overflow issue does not exist.
However, it is prudent to limit UEK4 as well.
Mike Kravetz [Tue, 28 May 2019 21:33:07 +0000 (14:33 -0700)]
hugetlbfs: don't retry when pool page allocations start to fail
When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages,
the pages will be interleaved between all nodes of the system. If
nodes are not equal, it is quite possible for one node to fill up
before the others. When this happens, the code still attempts to
allocate pages from the full node. This results in calls to direct
reclaim and compaction which slow things down considerably.
When allocating pool pages, note the state of the previous allocation
for each node. If previous allocation failed, do not use the
aggressive retry algorithm on successive attempts. The allocation
will still succeed if there is memory available, but it will not try
as hard to free up memory.
In routine hugetlb_hstate_alloc_pages, do not allocate the bitmap for
gigantic pages as this routine is called before runtime memory
allocators are available. Also, note that the bitmap is not necessary
in the case of boot time allocation of gigantic pages.
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: John Donnelly <john.p.donnelly@oracle.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: William Roche <william.roche@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
bugs.c vs bugs_64.c in UEK4
- the "__init" attribute on the is_skylake_era() function doesn't need
to be removed -- it's not there in UEK4.
Signed-off-by: William Roche <william.roche@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/spec_ctrl.h
Signed-off-by: William Roche <william.roche@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
arch/x86/kernel/cpu/spec_ctrl.c
bugs.c vs bugs_64.c in UEK4
spec_ctrl.c code still in bugs_64.c on UEK4
Signed-off-by: William Roche <william.roche@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/cpufeatures.h
arch/x86/include/asm/spec_ctrl.h
arch/x86/kernel/cpu/bugs.c
cpufeatures.h vs cpufeature.h in UEK4
include <linux/jump_label.h> header in spec_ctrl.h to use this feature
bugs.c vs bugs_64.c in UEK4
William Roche [Tue, 19 Feb 2019 14:11:12 +0000 (09:11 -0500)]
int3 handler better address space detection on interrupts
In order to prepare the possibility to dynamically change an
interrupt handler code with static_branch_enable/disable,
as the interrupt can equally appear while in user space or
kernel space, the int3 handler itself must better identify if
the original interrupt is from kernel or userland.
Signed-off-by: William Roche <william.roche@oracle.com> Co-developed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
(cherry picked from commit 594fc07cd96784004254680c9e1e4b757fb0a1f5)
Signed-off-by: William Roche <william.roche@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/entry/entry_64.S
entry/entry_64.S vs kernel/entry_64.S in UEK4
Mark Nicholson [Tue, 7 May 2019 23:25:59 +0000 (16:25 -0700)]
repairing out-of-tree build functionality
The current uek4 tree (only) cannot build the binrpm-pkg with an objdir
outside the source tree. This fix redirects the, incorrectly, placed
generated firmware files and firmware parsers into the objtree.
Signed-off-by: Mark Nicholson <mark.j.nicholson@oracle.com> Reviewed-by: Tianyue Lan <tianyue.lan@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Shuning Zhang [Wed, 29 May 2019 07:41:35 +0000 (15:41 +0800)]
ext4: fix false negatives *and* false positives in ext4_check_descriptors()
Ext4_check_descriptors() was getting called before s_gdb_count was
initialized. So for file systems w/o the meta_bg feature, allocation
bitmaps could overlap the block group descriptors and ext4 wouldn't
notice.
For file systems with the meta_bg feature enabled, there was a
fencepost error which would cause the ext4_check_descriptors() to
incorrectly believe that the block allocation bitmap overlaps with the
block group descriptor blocks, and it would reject the mount.
Fix both of these problems.
Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
(cherry picked from commit 44de022c4382541cebdd6de4465d1f4f465ff1dd) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/ext4/super.c
[The contextual has been changed]
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Shuning Zhang [Wed, 22 May 2019 01:32:40 +0000 (09:32 +0800)]
ocfs2: fix ocfs2 read inode data panic in ocfs2_iget
In some cases, ocfs2_iget() reads the data of inode, which has been
deleted for some reason. That will make the system panic. So We should
judge whether this inode has been deleted, and tell the caller that the
inode is a bad inode.
For example, the ocfs2 is used as the backed of nfs, and the client is
nfsv3. This issue can be reproduced by the following steps.
on the nfs server side,
..../patha/pathb
Step 1: The process A was scheduled before calling the function fh_verify.
Step 2: The process B is removing the 'pathb', and just completed the call
to function dput. Then the dentry of 'pathb' has been deleted from the
dcache, and all ancestors have been deleted also. The relationship of
dentry and inode was deleted through the function hlist_del_init. The
following is the call stack.
dentry_iput->hlist_del_init(&dentry->d_u.d_alias)
At this time, the inode is still in the dcache.
Step 3: The process A call the function ocfs2_get_dentry, which get the
inode from dcache. Then the refcount of inode is 1. The following is the
call stack.
nfsd3_proc_getacl->fh_verify->exportfs_decode_fh->fh_to_dentry(ocfs2_get_dentry)
Step 4: Dirty pages are flushed by bdi threads. So the inode of 'patha'
is evicted, and this directory was deleted. But the inode of 'pathb'
can't be evicted, because the refcount of the inode was 1.
Step 5: The process A keep running, and call the function
reconnect_path(in exportfs_decode_fh), which call function
ocfs2_get_parent of ocfs2. Get the block number of parent
directory(patha) by the name of ... Then read the data from disk by the
block number. But this inode has been deleted, so the system panic.
Process A Process B
1. in nfsd3_proc_getacl |
2. | dput
3. fh_to_dentry(ocfs2_get_dentry) |
4. bdi flush dirty cache |
5. ocfs2_iget |
[283465.542049] OCFS2: ERROR (device sdp): ocfs2_validate_inode_block:
Invalid dinode #580640: OCFS2_VALID_FL not set
[283465.545490] Kernel panic - not syncing: OCFS2: (device sdp): panic forced
after error
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com> Reviewed-by: John Donnelly <John.p.donnelly@Oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Marcel Holtmann [Fri, 18 Jan 2019 12:43:19 +0000 (13:43 +0100)]
Bluetooth: Verify that l2cap_get_conf_opt provides large enough buffer
The function l2cap_get_conf_opt will return L2CAP_CONF_OPT_SIZE + opt->len
as length value. The opt->len however is in control over the remote user
and can be used by an attacker to gain access beyond the bounds of the
actual packet.
To prevent any potential leak of heap memory, it is enough to check that
the resulting len calculation after calling l2cap_get_conf_opt is not
below zero. A well formed packet will always return >= 0 here and will
end with the length value being zero after the last option has been
parsed. In case of malformed packets messing with the opt->len field the
length value will become negative. If that is the case, then just abort
and ignore the option.
In case an attacker uses a too short opt->len value, then garbage will
be parsed, but that is protected by the unknown option handling and also
the option parameter size checks.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Orabug: 29526426
CVE: CVE-2019-3459
(cherry picked from commit 7c9cbd0b5e38a1672fcd137894ace3b042dfbf69) Signed-off-by: Dan Duval <dan.duval@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Marcel Holtmann [Fri, 18 Jan 2019 11:56:20 +0000 (12:56 +0100)]
Bluetooth: Check L2CAP option sizes returned from l2cap_get_conf_opt
When doing option parsing for standard type values of 1, 2 or 4 octets,
the value is converted directly into a variable instead of a pointer. To
avoid being tricked into being a pointer, check that for these option
types that sizes actually match. In L2CAP every option is fixed size and
thus it is prudent anyway to ensure that the remote side sends us the
right option size along with option paramters.
If the option size is not matching the option type, then that option is
silently ignored. It is a protocol violation and instead of trying to
give the remote attacker any further hints just pretend that option is
not present and proceed with the default values. Implementation
following the specification and its qualification procedures will always
use the correct size and thus not being impacted here.
To keep the code readable and consistent accross all options, a few
cosmetic changes were also required.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Orabug: 29526426
CVE: CVE-2019-3459
(cherry picked from commit af3d5d1c87664a4f150fcf3534c6567cb19909b0) Signed-off-by: Dan Duval <dan.duval@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Ring buffer implementation in hid_debug_event() and hid_debug_events_read()
is strange allowing lost or corrupted data. After commit 717adfdaf147
("HID: debug: check length before copy_to_user()") it is possible to enter
an infinite loop in hid_debug_events_read() by providing 0 as count, this
locks up a system. Fix this by rewriting the ring buffer implementation
with kfifo and simplify the code.
This fixes CVE-2019-3819.
v2: fix an execution logic and add a comment
v3: use __set_current_state() instead of set_current_state()
Backport to v4.4: some (tree-wide) patches are missing in v4.4 so
cherry-pick relevant pieces from:
* 6396bb22151 ("treewide: kzalloc() -> kcalloc()")
* a9a08845e9ac ("vfs: do bulk POLL* -> EPOLL* replacement")
* 92529623d242 ("HID: debug: improve hid_debug_event()")
* 174cd4b1e5fb ("sched/headers: Prepare to move signal wakeup & sigpending
methods from <linux/sched.h> into <linux/sched/signal.h>")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1669187 Cc: stable@vger.kernel.org # v4.18+ Fixes: cd667ce24796 ("HID: use debugfs for events/reports dumping") Fixes: 717adfdaf147 ("HID: debug: check length before copy_to_user()") Signed-off-by: Vladis Dronov <vdronov@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b661fff5f8a0f19824df91cc3905ba2c5b54dc87)
Reviewed-by: John Donnelly <John.p.donnelly@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
This change has the following effects, in order of descreasing importance:
1) Prevent a stack buffer overflow
2) Do not append an unnecessary NULL to an anyway binary buffer, which
is writing one byte past client_digest when caller is:
chap_string_to_hex(client_digest, chap_r, strlen(chap_r));
The latter was found by KASAN (see below) when input value hes expected size
(32 hex chars), and further analysis revealed a stack buffer overflow can
happen when network-received value is longer, allowing an unauthenticated
remote attacker to smash up to 17 bytes after destination buffer (16 bytes
attacker-controlled and one null). As switching to hex2bin requires
specifying destination buffer length, and does not internally append any null,
it solves both issues.
This addresses CVE-2018-14633.
Beyond this:
- Validate received value length and check hex2bin accepted the input, to log
this rejection reason instead of just failing authentication.
- Only log received CHAP_R and CHAP_C values once they passed sanity checks.
==================================================================
BUG: KASAN: stack-out-of-bounds in chap_string_to_hex+0x32/0x60 [iscsi_target_mod]
Write of size 1 at addr ffff8801090ef7c8 by task kworker/0:0/1021
Signed-off-by: Vincent Pelletier <plr.vincent@gmail.com> Reviewed-by: Mike Christie <mchristi@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 755e45f3155cc51e37dc1cce9ccde10b84df7d93)
Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Jason Yan [Tue, 25 Sep 2018 02:56:54 +0000 (10:56 +0800)]
scsi: libsas: fix a race condition when smp task timeout
When the lldd is processing the complete sas task in interrupt and set the
task stat as SAS_TASK_STATE_DONE, the smp timeout timer is able to be
triggered at the same time. And smp_task_timedout() will complete the task
wheter the SAS_TASK_STATE_DONE is set or not. Then the sas task may freed
before lldd end the interrupt process. Thus a use-after-free will happen.
Fix this by calling the complete() only when SAS_TASK_STATE_DONE is not
set. And remove the check of the return value of the del_timer(). Once the
LLDD sets DONE, it must call task->done(), which will call
smp_task_done()->complete() and the task will be completed and freed
correctly.
Reported-by: chenxiang <chenxiang66@hisilicon.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> CC: John Garry <john.garry@huawei.com> CC: Johannes Thumshirn <jthumshirn@suse.de> CC: Ewan Milne <emilne@redhat.com> CC: Christoph Hellwig <hch@lst.de> CC: Tomas Henzl <thenzl@redhat.com> CC: Dan Williams <dan.j.williams@intel.com> CC: Hannes Reinecke <hare@suse.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: John Garry <john.garry@huawei.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit b90cd6f2b905905fb42671009dc0e27c310a16ae)
Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>