www.infradead.org Git - users/jedix/linux-maple.git/log

KVM: coalesced_mmio: add bounds checking

The first/last indexes are typically shared with a user app.
The app can change the 'last' index that the kernel uses
to store the next result. This change sanity checks the index
before using it for writing to a potentially arbitrary address.

Signed-off-by: Matt Delco <delco@chromium.org>
Orabug: 30318042
CVE: CVE-2019-14821
[setje: This patch came to UEK while still under embargo.]
Signed-off-by: Jan Setje-Eilers <jan.setjeeilers@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

qla2xxx: Fix List corruption due to Get Name List

gnl command was not serialized, which resulted into
double addition of same object into list. This resulted
into list add corruption.

Orabug: 29894072

Signed-off-by: Quinn Tran <qutran@marvell.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Reviewed-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

qla2xxx: Update driver version 9.00.00.00.42.0-k1

Orabug: 29894072

Signed-off-by: Himanshu Madhani <hmadhani@marvell.com>
Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Reviewed-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen/swiotlb: remember having called xen_create_contiguous_region()

Instead of always calling xen_destroy_contiguous_region() in case the
memory is DMA-able for the used device, do so only in case it has been
made DMA-able via xen_create_contiguous_region() before.

This will avoid a lot of xen_destroy_contiguous_region() calls for
64-bit capable devices.

As the memory in question is owned by swiotlb-xen the PG_owner_priv_1
flag of the first allocated page can be used for remembering.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit b877ac9815a8fe7e5f6d7fdde3dc34652408840a)
Orabug: 30141778
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
PF_NO_COMPOUND is not used for PAGEFLAG() in uek4

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen/swiotlb: simplify range_straddles_page_boundary()

range_straddles_page_boundary() is open coding several macros from
include/xen/page.h. Use those instead. Additionally there is no need
to have check_pages_physically_contiguous() as a separate function as
it is used only once, so merge it into range_straddles_page_boundary().

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit bf70726668c6116aa4976e0cc87f470be6268a2f)
Orabug: 30141778
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen/swiotlb: fix condition for calling xen_destroy_contiguous_region()

The condition in xen_swiotlb_free_coherent() for deciding whether to
call xen_destroy_contiguous_region() is wrong: in case the region to
be freed is not contiguous calling xen_destroy_contiguous_region() is
the wrong thing to do: it would result in inconsistent mappings of
multiple PFNs to the same MFN. This will lead to various strange
crashes or data corruption.

Instead of calling xen_destroy_contiguous_region() in that case a
warning should be issued as that situation should never occur.

Cc: stable@vger.kernel.org
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit 50f6393f9654c561df4cdcf8e6cfba7260143601)
Orabug: 30141778
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

tcp: purge write queue in tcp_connect_init()

[ Upstream commit 7f582b248d0a86bae5788c548d7bb5bca6f7691a ]

syzkaller found a reliable way to crash the host, hitting a BUG()
in __tcp_retransmit_skb()

Malicous MSG_FASTOPEN is the root cause. We need to purge write queue
in tcp_connect_init() at the point we init snd_una/write_seq.

This patch also replaces the BUG() by a less intrusive WARN_ON_ONCE()

kernel BUG at net/ipv4/tcp_output.c:2837!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 5276 Comm: syz-executor0 Not tainted 4.17.0-rc3+ #51
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__tcp_retransmit_skb+0x2992/0x2eb0 net/ipv4/tcp_output.c:2837
RSP: 0000:ffff8801dae06ff8 EFLAGS: 00010206
RAX: ffff8801b9fe61c0 RBX: 00000000ffc18a16 RCX: ffffffff864e1a49
RDX: 0000000000000100 RSI: ffffffff864e2e12 RDI: 0000000000000005
RBP: ffff8801dae073a0 R08: ffff8801b9fe61c0 R09: ffffed0039c40dd2
R10: ffffed0039c40dd2 R11: ffff8801ce206e93 R12: 00000000421eeaad
R13: ffff8801ce206d4e R14: ffff8801ce206cc0 R15: ffff8801cd4f4a80
FS:  0000000000000000(0000) GS:ffff8801dae00000(0063) knlGS:00000000096bc900
CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 0000000020000000 CR3: 00000001c47b6000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
tcp_retransmit_skb+0x2e/0x250 net/ipv4/tcp_output.c:2923
tcp_retransmit_timer+0xc50/0x3060 net/ipv4/tcp_timer.c:488
tcp_write_timer_handler+0x339/0x960 net/ipv4/tcp_timer.c:573
tcp_write_timer+0x111/0x1d0 net/ipv4/tcp_timer.c:593
call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
expire_timers kernel/time/timer.c:1363 [inline]
__run_timers+0x79e/0xc50 kernel/time/timer.c:1666
run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
__do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
invoke_softirq kernel/softirq.c:365 [inline]
irq_exit+0x1d1/0x200 kernel/softirq.c:405
exiting_irq arch/x86/include/asm/apic.h:525 [inline]
smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:863

Fixes: cf60af03ca4e ("net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 5bbe138a250e795b32f29fc91af88c69851a12d3)

Orabug: 30240133
CVE: CVE-2019-15239

Signed-off-by: Larry Bassel <larry.bassel@oracle.com>
Reviewed-by: John Donnelly <john.p.donnelly@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

blk-mq: don't complete un-started request in timeout handler

When iterating busy requests in timeout handler,
if the STARTED flag of one request isn't set, that means
the request is being processed in block layer or driver, and
isn't submitted to hardware yet.

In current implementation of blk_mq_check_expired(),
if the request queue becomes dying, un-started requests are
handled as being completed/freed immediately. This way is
wrong, and can cause rq corruption or double allocation[1][2],
when doing I/O and removing&resetting NVMe device at the sametime.

This patch fixes several issues reported by Yi Zhang.

[1]. oops log 1
[  581.789754] ------------[ cut here ]------------
[  581.789758] kernel BUG at block/blk-mq.c:374!
[  581.789760] invalid opcode: 0000 [#1] SMP
[  581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
dm_region_hash dm_log dm_mod
[  581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
[  581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[  581.789804] Workqueue: kblockd blk_mq_timeout_work
[  581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
[  581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
[  581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
[  581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
[  581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
[  581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
[  581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
[  581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
[  581.789817] FS:  0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
[  581.789818] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
[  581.789820] Call Trace:
[  581.789825]  blk_mq_check_expired+0x76/0x80
[  581.789828]  bt_iter+0x45/0x50
[  581.789830]  blk_mq_queue_tag_busy_iter+0xdd/0x1f0
[  581.789832]  ? blk_mq_rq_timed_out+0x70/0x70
[  581.789833]  ? blk_mq_rq_timed_out+0x70/0x70
[  581.789840]  ? __switch_to+0x140/0x450
[  581.789841]  blk_mq_timeout_work+0x88/0x170
[  581.789845]  process_one_work+0x165/0x410
[  581.789847]  worker_thread+0x137/0x4c0
[  581.789851]  kthread+0x101/0x140
[  581.789853]  ? rescuer_thread+0x3b0/0x3b0
[  581.789855]  ? kthread_park+0x90/0x90
[  581.789860]  ret_from_fork+0x2c/0x40
[  581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f>
0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
[  581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
[  581.789889] ---[ end trace bcaf03d9a14a0a70 ]---

[2]. oops log2
[ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857373] PGD 0
[ 6984.857374]
[ 6984.857376] Oops: 0000 [#1] SMP
[ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
dm_region_hash dm_log dm_mod
[ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
4.10.0-2.el7.bz1420297.x86_64 #1
[ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
[ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
[ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
[ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
[ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
[ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
[ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
[ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
[ 6984.857439] FS:  0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
[ 6984.857440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
[ 6984.857443] Call Trace:
[ 6984.857451]  ? mempool_free+0x2b/0x80
[ 6984.857455]  ? bio_free+0x4e/0x60
[ 6984.857459]  blk_mq_dispatch_rq_list+0xf5/0x230
[ 6984.857462]  blk_mq_process_rq_list+0x133/0x170
[ 6984.857465]  __blk_mq_run_hw_queue+0x8c/0xa0
[ 6984.857467]  blk_mq_run_work_fn+0x12/0x20
[ 6984.857473]  process_one_work+0x165/0x410
[ 6984.857475]  worker_thread+0x137/0x4c0
[ 6984.857478]  kthread+0x101/0x140
[ 6984.857480]  ? rescuer_thread+0x3b0/0x3b0
[ 6984.857481]  ? kthread_park+0x90/0x90
[ 6984.857489]  ret_from_fork+0x2c/0x40
[ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c>
8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
[ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
[ 6984.857512] CR2: 0000000000000010
[ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---

Cc: stable@vger.kernel.org
Reported-by: Yi Zhang <yizhan@redhat.com>
Tested-by: Yi Zhang <yizhan@redhat.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 95a49603707d982b25d17c5b70e220a05556a2f9)

Orabug: 29903684
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Conflicts:
block/blk-mq.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

tcp: fix a stale ooo_last_skb after a replace

When skb replaces another one in ooo queue, I forgot to also
update tp->ooo_last_skb as well, if the replaced skb was the last one
in the queue.

To fix this, we simply can re-use the code that runs after an insertion,
trying to merge skbs at the right of current skb.

This not only fixes the bug, but also remove all small skbs that might
be a subset of the new one.

Example:

We receive segments 2001:3001,  4001:5001

Then we receive 2001:8001 : We should replace 2001:3001 with the big
skb, but also remove 4001:50001 from the queue to save space.

packetdrill test demonstrating the bug

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

+0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
+0.100 < . 1:1(0) ack 1 win 1024
+0 accept(3, ..., ...) = 4

+0.01 < . 1001:2001(1000) ack 1 win 1024
+0    > . 1:1(0) ack 1 <nop,nop, sack 1001:2001>

+0.01 < . 1001:3001(2000) ack 1 win 1024
+0    > . 1:1(0) ack 1 <nop,nop, sack 1001:2001 1001:3001>

Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Yuchung Cheng <ycheng@google.com>
Cc: Yaogong Wang <wygivan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 76f0dcbb5ae1a7c3dbeec13dd98233b8e6b0b32a)

Orabug: 29997352
Signed-off-by: Jacob Wen <jian.w.wen@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
net/ipv4/tcp_input.c
tcp_drop was introduced by d0f2a1c0e4

Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm: keep kabi compatibility of may_expand_vm() etc

One of the previous patches:

mm: rework virtual memory accounting

has modifications on the prototype of functions like may_expand_vm() and the
shared_vm field of mm struct, which actually break the compatibility of kabi.

In order to keep kabi compatibility intact, this patch changed the related
function prototype back and renamed the data_vm back to shared_vm.

Orabug: 30145754
Signed-off-by: Tong Chen <tong.c.chen@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm: always print RLIMIT_DATA warning

The documentation for ignore_rlimit_data says that it will print a
warning at first misuse. Yet it doesn't seem to do that.

Fix the code to print the warning even when we allow the process to
continue.

Link: http://lkml.kernel.org/r/1517935505-9321-1-git-send-email-dwmw@amazon.co.uk
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Acked-by: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 30145754
(cherry picked from commit 57a7702b12bc610393bf7764d25183392344ee92)
Signed-off-by: Tong Chen <tong.c.chen@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm: enable RLIMIT_DATA by default with workaround for valgrind

Since commit 84638335900f ("mm: rework virtual memory accounting")
RLIMIT_DATA limits both brk() and private mmap() but this's disabled by
default because of incompatibility with older versions of valgrind.

Valgrind always set limit to zero and fails if RLIMIT_DATA is enabled.
Fortunately it changes only rlim_cur and keeps rlim_max for reverting
limit back when needed.

This patch checks current usage also against rlim_max if rlim_cur is
zero. This is safe because task anyway can increase rlim_cur up to
rlim_max. Size of brk is still checked against rlim_cur, so this part
is completely compatible - zero rlim_cur forbids brk() but allows
private mmap().

Link: http://lkml.kernel.org/r/56A28613.5070104@de.ibm.com
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 30145754
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
mm/mmap.c

(cherry picked from commit f4fcd55841fc9e46daac553b39361572453c2b88)
Signed-off-by: Tong Chen <tong.c.chen@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm: warn about VmData over RLIMIT_DATA

This patch provides a way of working around a slight regression
introduced by commit 84638335900f ("mm: rework virtual memory
accounting").

Before that commit RLIMIT_DATA have control only over size of the brk
region.  But that change have caused problems with all existing versions
of valgrind, because it set RLIMIT_DATA to zero.

This patch fixes rlimit check (limit actually in bytes, not pages) and
by default turns it into warning which prints at first VmData misuse:

  "mmap: top (795): VmData 516096 exceed data ulimit 512000.  Will be forbidden soon."

Behavior is controlled by boot param ignore_rlimit_data=y/n and by sysfs
/sys/module/kernel/parameters/ignore_rlimit_data.  For now it set to "y".

[akpm@linux-foundation.org: tweak kernel-parameters.txt text[
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Link: http://lkml.kernel.org/r/20151228211015.GL2194@uranus
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Kees Cook <keescook@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 30145754

Fixed current pointer dereferencing error

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
mm/mmap.c

(cherry picked from commit d977d56ce5b3e8842236f2f9e7483d4914c9592e)
Signed-off-by: Tong Chen <tong.c.chen@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm: rework virtual memory accounting

When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
testing the RLIMIT_DATA value to figure out if we're allowed to assign
new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
commited that RLIMIT_DATA in a form it's implemented now doesn't do
anything useful because most of user-space libraries use mmap() syscall
for dynamic memory allocations.

Linus suggested to convert RLIMIT_DATA rlimit into something suitable
for anonymous memory accounting.  But in this patch we go further, and
the changes are bundled together as:

* keep vma counting if CONFIG_PROC_FS=n, will be used for limits
* replace mm->shared_vm with better defined mm->data_vm
* account anonymous executable areas as executable
* account file-backed growsdown/up areas as stack
* drop struct file* argument from vm_stat_account
* enforce RLIMIT_DATA for size of data areas

This way code looks cleaner: now code/stack/data classification depends
only on vm_flags state:

VM_EXEC & ~VM_WRITE            -> code  (VmExe + VmLib in proc)
VM_GROWSUP | VM_GROWSDOWN      -> stack (VmStk)
VM_WRITE & ~VM_SHARED & !stack -> data  (VmData)

The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
"shared", but that might be strange beast like readonly-private or VM_IO
area.

- RLIMIT_AS            limits whole address space "VmSize"
- RLIMIT_STACK         limits stack "VmStk" (but each vma individually)
- RLIMIT_DATA          now limits "VmData"

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Kees Cook <keescook@google.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 30145754
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
        fs/proc/task_mmu.c
        mm/mprotect.c

(cherry picked from commit 84638335900f1995495838fe1bd4870c43ec1f67)
Signed-off-by: Tong Chen <tong.c.chen@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm: add the "struct mm_struct *mm" local into

Cosmetic, but expand_upwards() and expand_downwards() overuse vma->vm_mm,
a local variable makes sense imho.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 30145754
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
mm/mmap.c

(cherry picked from commit 09357814778a38a5ab2d031cba6c9e9fe090c849)
Signed-off-by: Tong Chen <tong.c.chen@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm: fix the racy mm->locked_vm change in

"mm->locked_vm += grow" and vm_stat_account() in acct_stack_growth() are
not safe; multiple threads using the same ->mm can do this at the same
time trying to expans different vma's under down_read(mmap_sem). This
means that one of the "locked_vm += grow" changes can be lost and we can
miss munlock_vma_pages_all() later.

Move this code into the caller(s) under mm->page_table_lock. All other
updates to ->locked_vm hold mmap_sem for writing.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 30145754
(cherry picked from commit 87e8827b37c0c391d9915d0dc6a06c9b5f9cac65)
Signed-off-by: Tong Chen <tong.c.chen@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm/mmap.c: remove redundant local variables for may_expand_vm()

Simplify may_expand_vm().

[akpm@linux-foundation.org: further simplification, per Naoya Horiguchi]
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 30145754
(cherry picked from commit 0b57d6ba0bd11a41a791cf6c5bbf3b55630dc70f)
Signed-off-by: Tong Chen <tong.c.chen@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

block: loop: fix another reread part failure

loop_clr_fd() can be run piggyback with lo_release(), and
under this situation, reread partition may always fail because
bd_mutex has been held already.

This patch detects the situation by the reference count, and
call __blkdev_reread_part() to avoid acquiring the lock again.

In the meantime, this patch switches to new kernel APIs
of blkdev_reread_part() and __blkdev_reread_part().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry-picked from commit 06f0e9e68c0d81c7d822a405f6e35686a711c1fe)
Orabug: 30264603
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

block: loop: don't hold lo_ctl_mutex in lo_open

The lo_ctl_mutex is held for running all ioctl handlers, and
in some ioctl handlers, ioctl_by_bdev(BLKRRPART) is called for
rereading partitions, which requires bd_mutex.

So it is easy to cause failure because trylock(bd_mutex) may
fail inside blkdev_reread_part(), and follows the lock context:

blkid or other application:
->open()
->mutex_lock(bd_mutex)
->lo_open()
->mutex_lock(lo_ctl_mutex)

losetup(set fd ioctl):
->mutex_lock(lo_ctl_mutex)
->ioctl_by_bdev(BLKRRPART)
->trylock(bd_mutex)

This patch trys to eliminate the ABBA lock dependency by removing
lo_ctl_mutext in lo_open() with the following approach:

1) make lo_refcnt as atomic_t and avoid acquiring lo_ctl_mutex in lo_open():
- for open vs. add/del loop, no any problem because of loop_index_mutex
- freeze request queue during clr_fd, so I/O can't come until
clearing fd is completed, like the effect of holding lo_ctl_mutex
in lo_open
- both open() and release() have been serialized by bd_mutex already

2) don't hold lo_ctl_mutex for decreasing/checking lo_refcnt in
lo_release(), then lo_ctl_mutex is only required for the last release.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry-picked from commit f8933667953e8e61bb6104f5ca88e32e85656a93)
Orabug: 30264603
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

dm bufio: fix deadlock with loop device

When thin-volume is built on loop device, if available memory is low,
the following deadlock can be triggered:

One process P1 allocates memory with GFP_FS flag, direct alloc fails,
memory reclaim invokes memory shrinker in dm_bufio, dm_bufio_shrink_scan()
runs, mutex dm_bufio_client->lock is acquired, then P1 waits for dm_buffer
IO to complete in __try_evict_buffer().

But this IO may never complete if issued to an underlying loop device
that forwards it using direct-IO, which allocates memory using
GFP_KERNEL (see: do_blockdev_direct_IO()). If allocation fails, memory
reclaim will invoke memory shrinker in dm_bufio, dm_bufio_shrink_scan()
will be invoked, and since the mutex is already held by P1 the loop
thread will hang, and IO will never complete. Resulting in ABBA
deadlock.

Cc: stable@vger.kernel.org
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
(cherry picked from commit bd293d071ffe65e645b4d8104f9d8fe15ea13862)

Orabug: 29964645
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Shuning Zhang <sunny.s.zhang@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

dm bufio: don't take the lock in dm_bufio_shrink_count

dm_bufio_shrink_count() is called from do_shrink_slab to find out how many
freeable objects are there. The reported value doesn't have to be precise,
so we don't need to take the dm-bufio lock.

Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
(cherry picked from commit d12067f428c037b4575aaeb2be00847fc214c24a)

Orabug: 29964645
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Shuning Zhang <sunny.s.zhang@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

rds: rds-info shows IPv4 address as '0.0.0.0'

When the rds-info command reads IP address from send/retransmit queue,
it gets IPv4 address as '0.0.0.0'. It is due to reading the address
from the wrong offset.

Orabug: 30022915

Signed-off-by: aru kolappan <aru.kolappan@oracle.com>
Reviewed-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

restore cond_resched() in shrink_dcache_parent()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 4fb48871409e2fcd375087d526d07f7600c88f94)

Orabug: 30101895

Signed-off-by: John Donnelly <john.p.donnelly@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/dcache.c - Upstream did not have 4th arg NULL to d_walk().

Signed-off-by: Brian Maly <brian.maly@oracle.com>

retpoline: Move retpoline_mode_selected() out of .init.text section

Remove the __init macro from the definition of retpoline_mode_selected(),
since there are functions that call it after initialization. So far
a problem has not occurred because calls to retpoline_mode_selected()
are inlined by the compiler, but this could change in the future.

Orabug: 30250332

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reported-by: Jamie Iles <jamie.iles@oracle.com>
Reviewed-by: Jamie Iles <jamie.iles@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen-netback: use irqsave/irqrestore in xenvif_rx_dequeue()

xenvif_rx_action() acquires, releases queue->rx_lock via spin_lock_irq(),
spin_unlock_irq(). This in-turn calls xenvif_rx_dequeue(), which
acquires, releases a different spinlock, queue->rx_queue.lock, also via
spin_lock_irq(), spin_unlock_irq(). The second set of calls is
problematic because it leads to irqs being enabled early:

   spin_lock_irq(&queue->rx_lock);         /* xenvif_rx_action */
   spin_lock_irq(&queue->rx_queue.lock);   /* xenvif_rx_dequeue */
   spin_unlock_irq(&queue->rx_queue.lock); /* xenvif_rx_dequeue */
>>   /* irqs already enabled */
   spin_unlock_irq(&queue->rx_lock);       /* xenvif_rx_action */

In general spin_lock_irq(), spin_unlock_irq() calls cannot be composed.

Fixes: 60fb2abf8867 ("xen-netback: copy buffer on xenvif_start_xmit")
(cherry picked from commit 75a74bc16c1f378e035604dc0c730ef2ab061d09)

Orabug: 30223112
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

SUNRPC fix regression in umount of a secure mount

If call_status returns ENOTCONN, we need to re-establish the connection
state after. Otherwise the client goes into an infinite loop of call_encode,
call_transmit, call_status (ENOTCONN), call_encode.

Fixes: c8485e4d63 ("SUNRPC: Handle ECONNREFUSED correctly in xprt_transmit()")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Cc: stable@vger.kernel.org # v2.6.29+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
(cherry picked from commit ec6017d9035986a36de064f48a63245930bfad6f)
Orabug: 29926734
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Reviewed-by: Bill Baker <bill.baker@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

block: fix RO partition with RW disk

When md raid1 was used with imsm metadata, during the boot stage,
the raid device will first be set to readonly, then mdmon will set
it read-write later. When there were some partitions in this device,
the following race would make some partition left ro and fail to mount.

CPU 1:                                                 CPU 2:
add_partition()                                        set_disk_ro() //set disk RW
//disk was RO, so partition set to RO
p->policy = get_disk_ro(disk);
                                                        if (disk->part0.policy != flag) {
                                                            set_disk_ro_uevent(disk, flag);
                                                            // disk set to RW
                                                            disk->part0.policy = flag;
                                                        }
                                                        // set all exit partition to RW
                                                        while ((part = disk_part_iter_next(&piter)))
                                                            part->policy = flag;
// this part was not yet added, so it was still RO
rcu_assign_pointer(ptbl->part[partno], p);

Move RO status setting of partitions after they were added into partition
table and introduce a mutex to sync RO status between disk and partitions.

Orabug: 30159928

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

retpoline: Show correct spectrev2 mitigation after loading non-retpoline module

If a loaded kernel module is not built with retpoline capabilities,
the kernel is tainted and sysfs reports the system as "Vulnerable"
to spectre v2, even though the retpoline mitigation is still enabled.

Change the message displayed in sysfs to report when a non-retpoline
module has been loaded using the new format:

Mitigation: Full generic retpoline (non-retpoline module(s) has been loaded), IBRS_FW, IBPB

This enables more precise tracking of the security status by
differentiating the cases where no spectre v2 mitigation is
available (Vulnerable), and when retpoline is available/active but
a vulnerable module has introduced a potential attack vector.

Orabug: 30185537

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

tun: allow positive return values on dev_get_valid_name() call

If the name argument of dev_get_valid_name() contains "%d", it will try
to assign it a unit number in __dev__alloc_name() and return either the
unit number (>= 0) or an error code (< 0).
Considering positive values as error values prevent tun device creations
relying this mechanism, therefor we should only consider negative values
as errors here.

Signed-off-by: Julien Gomes <julien@arista.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 30085611

(cherry picked from commit 5c25f65fd1e42685f7ccd80e0621829c105785d9)
Signed-off-by: Jacob Wen <jian.w.wen@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen: let alloc_xenballooned_pages() fail if not enough memory free

Instead of trying to allocate pages with GFP_USER in
add_ballooned_pages() check the available free memory via
si_mem_available(). GFP_USER is far less limiting memory exhaustion
than the test via si_mem_available().

This will avoid dom0 running out of memory due to excessive foreign
page mappings especially on ARM and on x86 in PVH mode, as those don't
have a pre-ballooned area which can be used for foreign mappings.

As the normal ballooning suffers from the same problem don't balloon
down more than si_mem_available() pages in one iteration. At the same
time limit the default maximum number of retries.

This is part of XSA-300.

Signed-off-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit a1078e821b605813b63bf6bca414a85f804d5c66)

Orabug: 30073695

CVE has not been assigned yet.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Reviewed-by: Patrick Colp <patrick.colp@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm/page_alloc.c: calculate 'available' memory in a separate function

Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
statistics protocol, corresponding to 'Available' in /proc/meminfo.

It indicates to the hypervisor how big the balloon can be inflated
without pushing the guest system to swap.  This metric would be very
useful in VM orchestration software to improve memory management of
different VMs under overcommit.

This patch (of 2):

Factor out calculation of the available memory counter into a separate
exportable function, in order to be able to use it in other parts of the
kernel.

In particular, it appears a relevant metric to report to the hypervisor
via virtio-balloon statistics interface (in a followup patch).

Signed-off-by: Igor Redko <redkoi@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit d02bd27bd33dd7e8d22594cd568b81be0cb584cd)

Orabug: 30073695

Need this to provide si_mem_available() for the next patch

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Patrick Colp <patrick.colp@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/proc/meminfo.c: we don't have commit 84ad5802a33a ("proc:
        meminfo: estimate available memory more conservatively") and
        even though it looks reasonable and simple I didn't want to
        include it because this would change a user-visible attribute.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

Input: gtco - bounds check collection indent level

The GTCO tablet input driver configures itself from an HID report sent
via USB during the initial enumeration process. Some debugging messages
are generated during the parsing. A debugging message indentation
counter is not bounds checked, leading to the ability for a specially
crafted HID report to cause '-' and null bytes be written past the end
of the indentation array. As long as the kernel has CONFIG_DYNAMIC_DEBUG
enabled, this code will not be optimized out. This was discovered
during code review after a previous syzkaller bug was found in this
driver.

Signed-off-by: Grant Hernandez <granthernandez@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
(cherry picked from commit 2a017fd82c5402b3c8df5e3d6e5165d9e6147dc1)

Orabug: 30074413
CVE: CVE-2019-13631

Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Documentation/Docbook/Makefile: process xml files in parallel, based on nproc --all value

Documentation/Docbook/Makefile: process xml files in parallel, based on nproc --all value
This commits introduces a few simple changes, that keep execution of doc compilation serial
( docbook documentation Makefile does not support parallelisation ), but allow parallel
production of xml files. Also compression of man files is now happening in parallel as well.
Number of max threads is limited to nproc --all value.

Orabug: 30079814

Signed-off-by: Alex Burmashev <alexander.burmashev@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: rds: fix compile warning

The commit a5ee13ea4ae4 ("rds: IB: fix returned value not set error")
causes this problem. When compliling, the following warning will appear.
"
net/rds/ib.c: In function rds_ib_do_failover
net/rds/ib.c:1386: warning: unused variable ret
"

Orabug: 30118035

Fixes: a5ee13ea4ae4 ("rds: IB: fix returned value not set error")
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Reviewed-by: Junxiao Bi<junxiao.bi@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Revert "xen-netback: stop netif TX queue on guest queuing failure"

Orabug: 30128203

This reverts commit 9a196f3f92c9506e954a7ca37537c8a8b8faadc1.

xen-netback: stop netif TX queue on guest queuing failure

On a failure to queue an TX skb to the guest's RX io_ring, we return
NETDEV_TX_BUSY without stopping the netif-queue. If the guest's io_ring
is full, this can lead to unbounded retry.

This fix stops the queue when hit this condition and restarts it once
the guest creates space in its RX io_ring. This also hardens against
malicious guests where the guest netfront can cause DoS on the host by
just not putting responses on the RX io_ring.

The guest can send an rx_interrupt even before the vif is fully
provisioned, so do this only after the vif moves to VIF_STATE_CONNECTED.

Fixes: fb9041f4f7f6 ("xen-netback: copy buffer on xenvif_start_xmit")
(cherry picked from commit c655018645c32745041e5a3d03cbfbf6f0d820ee)

Orabug: 30028200
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: megaraid_sas: fix panic on loading firmware crashdump

While loading fw crashdump in function fw_crash_buffer_show(), left bytes
in one dma chunk was not checked, if copying size over it, overflow access
will cause kernel panic.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Sumit Saxena <sumit.saxena@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 3b5f307ef3cb5022bfe3c8ca5b8f2114d5bf6c29)

Orabug: 29993112
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: John Donnelly <John.p.donnnelly@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Exclude ATOMs from speculation through SWAPGS

Intel provided the following information:

On all current Atom processors, instructions that use a segment register
value (e.g. a load or store) will not speculatively execute before the
last writer of that segment retires. Thus they will not use a
speculatively written segment value.

That means on ATOMs there is no speculation through SWAPGS, so the SWAPGS
entry paths can be excluded from the extra LFENCE if PTI is disabled.

Create a separate bug flag for the through SWAPGS speculation and mark all
out-of-order ATOMs and AMD/HYGON CPUs as not affected. The in-order ATOMs
are excluded from the whole mitigation mess anyway.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
(cherry picked from commit f36cf386e3fec258a341d446915862eded3e13d8)
Orabug: 29967571
CVE: CVE-2019-1125

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Enable Spectre v1 swapgs mitigations

The previous commit added macro calls in the entry code which mitigate the
Spectre v1 swapgs issue if the X86_FEATURE_FENCE_SWAPGS_* features are
enabled.  Enable those features where applicable.

The mitigations may be disabled with "nospectre_v1" or "mitigations=off".

There are different features which can affect the risk of attack:

- When FSGSBASE is enabled, unprivileged users are able to place any
  value in GS, using the wrgsbase instruction.  This means they can
  write a GS value which points to any value in kernel space, which can
  be useful with the following gadget in an interrupt/exception/NMI
  handler:

if (coming from user space)
swapgs
mov %gs:<percpu_offset>, %reg1
// dependent load or store based on the value of %reg
// for example: mov %(reg1), %reg2

  If an interrupt is coming from user space, and the entry code
  speculatively skips the swapgs (due to user branch mistraining), it
  may speculatively execute the GS-based load and a subsequent dependent
  load or store, exposing the kernel data to an L1 side channel leak.

  Note that, on Intel, a similar attack exists in the above gadget when
  coming from kernel space, if the swapgs gets speculatively executed to
  switch back to the user GS.  On AMD, this variant isn't possible
  because swapgs is serializing with respect to future GS-based
  accesses.

  NOTE: The FSGSBASE patch set hasn't been merged yet, so the above case
doesn't exist quite yet.

- When FSGSBASE is disabled, the issue is mitigated somewhat because
  unprivileged users must use prctl(ARCH_SET_GS) to set GS, which
  restricts GS values to user space addresses only.  That means the
  gadget would need an additional step, since the target kernel address
  needs to be read from user space first.  Something like:

if (coming from user space)
swapgs
mov %gs:<percpu_offset>, %reg1
mov (%reg1), %reg2
// dependent load or store based on the value of %reg2
// for example: mov %(reg2), %reg3

  It's difficult to audit for this gadget in all the handlers, so while
  there are no known instances of it, it's entirely possible that it
  exists somewhere (or could be introduced in the future).  Without
  tooling to analyze all such code paths, consider it vulnerable.

  Effects of SMAP on the !FSGSBASE case:

  - If SMAP is enabled, and the CPU reports RDCL_NO (i.e., not
    susceptible to Meltdown), the kernel is prevented from speculatively
    reading user space memory, even L1 cached values.  This effectively
    disables the !FSGSBASE attack vector.

  - If SMAP is enabled, but the CPU *is* susceptible to Meltdown, SMAP
    still prevents the kernel from speculatively reading user space
    memory.  But it does *not* prevent the kernel from reading the
    user value from L1, if it has already been cached.  This is probably
    only a small hurdle for an attacker to overcome.

Thanks to Dave Hansen for contributing the speculative_smap() function.

Thanks to Andrew Cooper for providing the inside scoop on whether swapgs
is serializing on AMD.

[ tglx: Fixed the USER fence decision and polished the comment as suggested
   by Dave Hansen ]

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
(cherry picked from commit a2059825986a1c8143fd6698774fa9d83733bb11)
Orabug: 29967571
CVE: CVE-2019-1125

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
kernel-parameters.txt The file location is different. Manual
edit to file.
bugs.c The changes are manually ported to
bugs_64.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations

Spectre v1 isn't only about array bounds checks.  It can affect any
conditional checks.  The kernel entry code interrupt, exception, and NMI
handlers all have conditional swapgs checks.  Those may be problematic in
the context of Spectre v1, as kernel code can speculatively run with a user
GS.

For example:

if (coming from user space)
swapgs
mov %gs:<percpu_offset>, %reg
mov (%reg), %reg1

When coming from user space, the CPU can speculatively skip the swapgs, and
then do a speculative percpu load using the user GS value.  So the user can
speculatively force a read of any kernel value.  If a gadget exists which
uses the percpu value as an address in another load/store, then the
contents of the kernel value may become visible via an L1 side channel
attack.

A similar attack exists when coming from kernel space.  The CPU can
speculatively do the swapgs, causing the user GS to get used for the rest
of the speculative window.

The mitigation is similar to a traditional Spectre v1 mitigation, except:

  a) index masking isn't possible; because the index (percpu offset)
     isn't user-controlled; and

  b) an lfence is needed in both the "from user" swapgs path and the
     "from kernel" non-swapgs path (because of the two attacks described
     above).

The user entry swapgs paths already have SWITCH_TO_KERNEL_CR3, which has a
CR3 write when PTI is enabled.  Since CR3 writes are serializing, the
lfences can be skipped in those cases.

On the other hand, the kernel entry swapgs paths don't depend on PTI.

To avoid unnecessary lfences for the user entry case, create two separate
features for alternative patching:

  X86_FEATURE_FENCE_SWAPGS_USER
  X86_FEATURE_FENCE_SWAPGS_KERNEL

Use these features in entry code to patch in lfences where needed.

The features aren't enabled yet, so there's no functional change.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
(cherry picked from commit 18ec54fdd6d18d92025af097cd042a75cf0ea24c)
Orabug: 29967571
CVE: CVE-2019-1125

Signed-off-by: Kanth Ghatraju <kanth.ghatraju@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
cpufeatures.h Changes implemented in cpufeature.h.
X86_FEATURE_FENCE_SWAPGS_USER and
X86_FEATURE_FENCE_SWAPGS_KERNEL were assigned
bits in word 11 after some cleaning up of
words 11 and 12 in the patchsets 1674293
and dfe8715. However, this breaks the
kABI for UEK kernels. Use the free bits
in word 2 instead.
calling.h Context
entry_64.S The code is significantly different and the
changes were done manually.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

mlx4_core: change log_num_{qp,rdmarc} with scale_profile

When module parameter 'scale_profile' is set we
use different (than default) parameter values for
certain mlx4_core module parameters which define
some resource limits.

This changes 'log_num_qp' and 'log_num_rdmarc'
to lower values to fix some issues leading to
undesirable HCA resource usage. These led to
undesirable behavior in conjunction with current
round-robin allocation of QPs where some
long-lasting QPs "polluted" ICM memory chunks.

Orabug: 30064080

Suggested-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: storvsc: Fix scsi_cmd error assignments in storvsc_handle_error

When an I/O is returned with an srb_status of SRB_STATUS_INVALID_LUN
which has zero good_bytes it must be assigned an error. Otherwise the
I/O will be continuously requeued and will cause a deadlock in the case
where disks are being hot added and removed. sd_probe_async will wait
forever for its I/O to complete while holding scsi_sd_probe_domain.

Also returning the default error of DID_TARGET_FAILURE causes multipath
to not retry the I/O resulting in applications receiving I/O errors
before a failover can occur.

Signed-off-by: Cathy Avery <cavery@redhat.com>
Signed-off-by: Long Li <longli@microsoft.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit d1b8b2391c24751e44f618fcf86fb55d9a9247fd)
Orabug: 30052805
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

USB: check usb_get_extra_descriptor for proper size

Orabug: 29755247
CVE: CVE-2018-20169

When reading an extra descriptor, we need to properly check the minimum
and maximum size allowed, to prevent from invalid data being sent by a
device.

Reported-by: Hui Peng <benquike@gmail.com>
Reported-by: Mathias Payer <mathias.payer@nebelwelt.net>
Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Hui Peng <benquike@gmail.com>
Signed-off-by: Mathias Payer <mathias.payer@nebelwelt.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 704620afc70cf47abb9d6a1a57f3825d2bca49cf)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/usb/core/hub.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

rds: ib: Fix dereference of conn when NULL and cleanup thereof

When rds_ib_cm_connect_complete() and rds_ib_flush_arp_entry() is
called from rds_rdma_cm_event_handler_cmn(), conn may be NULL and a
NULL pointer dereference will happen.

Also cleaned the code by performing the NULL check once.

Orabug: 29924849

Signed-off-by: Dag Moxnes <dag.moxnes@oracle.com>
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Gerd Rausch <gerd.rausch@oracle.com>
(cherry picked from commit b9c03e078f55f3f7cf9d8aba001ca83ab90e1b82)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: zero out the unused memory region in the extent tree block

This commit zeroes out the unused memory region in the buffer_head
corresponding to the extent metablock after writing the extent header
and the corresponding extent node entries.

This is done to prevent random uninitialized data from getting into
the filesystem when the extent block is synced.

This fixes CVE-2019-11833.

Signed-off-by: Sriram Rajagopalan <sriramr@arista.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
(cherry picked from commit 592acbf16821288ecdc4192c47e3774a4c48bb64)

Orabug: 29925523
CVE: CVE-2019-11833

Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ip_sockglue: Fix missing-check bug in ip_ra_control()

In function ip_ra_control(), the pointer new_ra is allocated a memory
space via kmalloc(). And it is used in the following codes. However,
when there is a memory allocation error, kmalloc() fails. Thus null
pointer dereference may happen. And it will cause the kernel to crash.
Therefore, we should check the return value and handle the error.

Signed-off-by: Gen Zhang <blackgod016574@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 425aa0e1d01513437668fa3d4a971168bbaa8515)

Orabug: 29926005
CVE: CVE-2019-12381

Reviewed-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ipv6_sockglue: Fix a missing-check bug in ip6_ra_control()

In function ip6_ra_control(), the pointer new_ra is allocated a memory
space via kmalloc(). And it is used in the following codes. However,
when there is a memory allocation error, kmalloc() fails. Thus null
pointer dereference may happen. And it will cause the kernel to crash.
Therefore, we should check the return value and handle the error.

Signed-off-by: Gen Zhang <blackgod016574@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 95baa60a0da80a0143e3ddd4d3725758b4513825)

Orabug: 29926057
CVE: CVE-2019-12378

Reviewed-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/microcode: fix x86_spec_ctrl_mask on late loading.

Fixes: d447e92c7d0a ("x86/microcode: add SPEC_CTRL_SSBD to x86_spec_ctrl_mask on late loading.")
This fixes potential hypervisor panic on guest writes to IA32_SPEC_CTRL after
microcode updates. We made available SSBD if we had a bug in the CPU. The
correct approach is to make available SSBD if we have the flag available.

Orabug: 29941248

Reported-by: Jamie Iles <jamie.iles@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Jamie Iles <jamie.iles@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net: rds: fix rds recv memory leak

In the function rds_ib_inc_free, when rds frags size is changed,
for example, firstly 4K frag is used, then 16K frag is used (for
example during uek2 to uek4 upgrade), this will make
"sg_total_lens(frag->f_sg) != ic->i_frag_sz" true.
In this case, only frag pages are freed while frag is not freed.
This will cause memory leak.

Orabug: 30034815

Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
(cherry picked from commit e877884cbdd7a0ad497b101741ecc66cafa032b3
UEK5-MASTER)

Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: libfc: Fixup disc_mutex handling in fcoe module

The list of attached 'rdata' remote port structures is RCU
protected, so there is no need to take the 'disc_mutex' when
traversing it.
Rather we should be using rcu_read_lock() and kref_get_unless_zero()
to validate the entries.
We need, however, take the disc_mutex when deleting an entry;
otherwise we risk clashes with list_add.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Acked-by: Johannes Thumshirn <jth@kernel.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 29511036

(cherry picked from commit a407c593398c886db4fa1fc5c6fec55e61187a09)
Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: libfc: sanitize E_D_TOV and R_A_TOV setting in fcp

When setting the FCP timeout we need to ensure a lower boundary
for E_D_TOV and R_A_TOV, otherwise we'd be getting spurious I/O
issues due to the fcp timer firing too early.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Acked-by: Johannes Thumshirn <jth@kernel.org>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 29511036

(cherry picked from commit 76e72ad117812bb79abf647ac40ca6df1740b729)
Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

sysctl: Fix kabi breakage

Using __GENKSYMS__ to fix kabi breakage.

Orabug: 29689925

Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

proc: Fix proc_sys_prune_dcache to hold a sb reference

Andrei Vagin writes:
FYI: This bug has been reproduced on 4.11.7
> BUG: Dentry ffff895a3dd01240{i=4e7c09a,n=lo}  still in use (1) [unmount of proc proc]
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 13588 at fs/dcache.c:1445 umount_check+0x6e/0x80
> CPU: 1 PID: 13588 Comm: kworker/1:1 Not tainted 4.11.7-200.fc25.x86_64 #1
> Hardware name: CompuLab sbc-flt1/fitlet, BIOS SBCFLT_0.08.04 06/27/2015
> Workqueue: events proc_cleanup_work
> Call Trace:
>  dump_stack+0x63/0x86
>  __warn+0xcb/0xf0
>  warn_slowpath_null+0x1d/0x20
>  umount_check+0x6e/0x80
>  d_walk+0xc6/0x270
>  ? dentry_free+0x80/0x80
>  do_one_tree+0x26/0x40
>  shrink_dcache_for_umount+0x2d/0x90
>  generic_shutdown_super+0x1f/0xf0
>  kill_anon_super+0x12/0x20
>  proc_kill_sb+0x40/0x50
>  deactivate_locked_super+0x43/0x70
>  deactivate_super+0x5a/0x60
>  cleanup_mnt+0x3f/0x90
>  mntput_no_expire+0x13b/0x190
>  kern_unmount+0x3e/0x50
>  pid_ns_release_proc+0x15/0x20
>  proc_cleanup_work+0x15/0x20
>  process_one_work+0x197/0x450
>  worker_thread+0x4e/0x4a0
>  kthread+0x109/0x140
>  ? process_one_work+0x450/0x450
>  ? kthread_park+0x90/0x90
>  ret_from_fork+0x2c/0x40
> ---[ end trace e1c109611e5d0b41 ]---
> VFS: Busy inodes after unmount of proc. Self-destruct in 5 seconds.  Have a nice day...
> BUG: unable to handle kernel NULL pointer dereference at           (null)
> IP: _raw_spin_lock+0xc/0x30
> PGD 0

Fix this by taking a reference to the super block in proc_sys_prune_dcache.

The superblock reference is the core of the fix however the sysctl_inodes
list is converted to a hlist so that hlist_del_init_rcu may be used.  This
allows proc_sys_prune_dache to remove inodes the sysctl_inodes list, while
not causing problems for proc_sys_evict_inode when if it later choses to
remove the inode from the sysctl_inodes list.  Removing inodes from the
sysctl_inodes list allows proc_sys_prune_dcache to have a progress
guarantee, while still being able to drop all locks.  The fact that
head->unregistering is set in start_unregistering ensures that no more
inodes will be added to the the sysctl_inodes list.

Previously the code did a dance where it delayed calling iput until the
next entry in the list was being considered to ensure the inode remained on
the sysctl_inodes list until the next entry was walked to.  The structure
of the loop in this patch does not need that so is much easier to
understand and maintain.

Cc: stable@vger.kernel.org
Reported-by: Andrei Vagin <avagin@gmail.com>
Tested-by: Andrei Vagin <avagin@openvz.org>
Fixes: ace0c791e6c3 ("proc/sysctl: Don't grab i_lock under sysctl_lock.")
Fixes: d6cffbbe9a7e ("proc/sysctl: prune stale dentries during unregistering")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
(cherry picked from commit 2fd1d2c4ceb2248a727696962cf3370dc9f5a0a4)

Orabug: 29689925

Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

proc/sysctl: Don't grab i_lock under sysctl_lock.

Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
> This patch has locking problem. I've got lockdep splat under LTP.
>
> [ 6633.115456] ======================================================
> [ 6633.115502] [ INFO: possible circular locking dependency detected ]
> [ 6633.115553] 4.9.10-debug+ #9 Tainted: G             L
> [ 6633.115584] -------------------------------------------------------
> [ 6633.115627] ksm02/284980 is trying to acquire lock:
> [ 6633.115659]  (&sb->s_type->i_lock_key#4){+.+...}, at: [<ffffffff816bc1ce>] igrab+0x1e/0x80
> [ 6633.115834] but task is already holding lock:
> [ 6633.115882]  (sysctl_lock){+.+...}, at: [<ffffffff817e379b>] unregister_sysctl_table+0x6b/0x110
> [ 6633.116026] which lock already depends on the new lock.
> [ 6633.116026]
> [ 6633.116080]
> [ 6633.116080] the existing dependency chain (in reverse order) is:
> [ 6633.116117]
> -> #2 (sysctl_lock){+.+...}:
> -> #1 (&(&dentry->d_lockref.lock)->rlock){+.+...}:
> -> #0 (&sb->s_type->i_lock_key#4){+.+...}:
>
> d_lock nests inside i_lock
> sysctl_lock nests inside d_lock in d_compare
>
> This patch adds i_lock nesting inside sysctl_lock.

Al Viro <viro@ZenIV.linux.org.uk> replied:
> Once ->unregistering is set, you can drop sysctl_lock just fine.  So I'd
> try something like this - use rcu_read_lock() in proc_sys_prune_dcache(),
> drop sysctl_lock() before it and regain after.  Make sure that no inodes
> are added to the list ones ->unregistering has been set and use RCU list
> primitives for modifying the inode list, with sysctl_lock still used to
> serialize its modifications.
>
> Freeing struct inode is RCU-delayed (see proc_destroy_inode()), so doing
> igrab() is safe there.  Since we don't drop inode reference until after we'd
> passed beyond it in the list, list_for_each_entry_rcu() should be fine.

I agree with Al Viro's analsysis of the situtation.

Fixes: d6cffbbe9a7e ("proc/sysctl: prune stale dentries during unregistering")
Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Tested-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
(cherry picked from commit ace0c791e6c3cf5ef37cad2df69f0d90ccc40ffb)

Orabug: 29689925

Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

proc/sysctl: prune stale dentries during unregistering

Currently unregistering sysctl table does not prune its dentries.
Stale dentries could slowdown sysctl operations significantly.

For example, command:

# for i in {1..100000} ; do unshare -n -- sysctl -a &> /dev/null ; done
creates a millions of stale denties around sysctls of loopback interface:

# sysctl fs.dentry-state
fs.dentry-state = 25812579  24724135        45      0       0       0

All of them have matching names thus lookup have to scan though whole
hash chain and call d_compare (proc_sys_compare) which checks them
under system-wide spinlock (sysctl_lock).

# time sysctl -a > /dev/null
real    1m12.806s
user    0m0.016s
sys     1m12.400s

Currently only memory reclaimer could remove this garbage.
But without significant memory pressure this never happens.

This patch collects sysctl inodes into list on sysctl table header and
prunes all their dentries once that table unregisters.

Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
> On 10.02.2017 10:47, Al Viro wrote:
>> how about >> the matching stats *after* that patch?
>
> dcache size doesn't grow endlessly, so stats are fine
>
> # sysctl fs.dentry-state
> fs.dentry-state = 92712 58376 45 0 0 0
>
> # time sysctl -a &>/dev/null
>
> real 0m0.013s
> user 0m0.004s
> sys 0m0.008s

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
(cherry picked from commit d6cffbbe9a7e51eb705182965a189457c17ba8a3)

Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/proc/proc_sysctl.c Context has changed

Orabug: 29689925

Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: smartpqi: correct lun reset issues

[ Upstream commit 2ba55c9851d74eb015a554ef69ddf2ef061d5780 ]

Problem:
The Linux kernel takes a logical volume offline after a LUN reset. This is
generally accompanied by this message in the dmesg output:

Device offlined - not ready after error recovery

Root Cause:
The root cause is a "quirk" in the timeout handling in the Linux SCSI
layer. The Linux kernel places a 30-second timeout on most media access
commands (reads and writes) that it send to device drivers. When a media
access command times out, the Linux kernel goes into error recovery mode
for the LUN that was the target of the command that timed out. Every
command that timed out is kept on a list inside of the Linux kernel to be
retried later. The kernel attempts to recover the command(s) that timed out
by issuing a LUN reset followed by a TEST UNIT READY. If the LUN reset and
TEST UNIT READY commands are successful, the kernel retries the command(s)
that timed out.

Each SCSI command issued by the kernel has a result field associated with
it. This field indicates the final result of the command (success or
error). When a command times out, the kernel places a value in this result
field indicating that the command timed out.

The "quirk" is that after the LUN reset and TEST UNIT READY commands are
completed, the kernel checks each command on the timed-out command list
before retrying it. If the result field is still "timed out", the kernel
treats that command as not having been successfully recovered for a
retry. If the number of commands that are in this state are greater than
two, the kernel takes the LUN offline.

Fix:
When our RAIDStack receives a LUN reset, it simply waits until all
outstanding commands complete. Generally, all of these outstanding commands
complete successfully. Therefore, the fix in the smartpqi driver is to
always set the command result field to indicate success when a request
completes successfully. This normally isn’t necessary because the result
field is always initialized to success when the command is submitted to the
driver. So when the command completes successfully, the result field is
left untouched. But in this case, the kernel changes the result field
behind the driver’s back and then expects the field to be changed by the
driver as the commands that timed-out complete.

Reviewed-by: Dave Carroll <david.carroll@microsemi.com>
Reviewed-by: Scott Teel <scott.teel@microsemi.com>
Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com>
Signed-off-by: Don Brace <don.brace@microsemi.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Orabug: 29848621

(cherry picked from commit ceecd0c696456068c2a0e1f0d3c32e5290d115ab)
Signed-off-by: Rajan Shanmugavelu <rajan.shanmugavelu@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

fork: record start_time late

commit 7b55851367136b1efd84d98fea81ba57a98304cf upstream.

This changes the fork(2) syscall to record the process start_time after
initializing the basic task structure but still before making the new
process visible to user-space.

Technically, we could record the start_time anytime during fork(2).  But
this might lead to scenarios where a start_time is recorded long before
a process becomes visible to user-space.  For instance, with
userfaultfd(2) and TLS, user-space can delay the execution of fork(2)
for an indefinite amount of time (and will, if this causes network
access, or similar).

By recording the start_time late, it much closer reflects the point in
time where the process becomes live and can be observed by other
processes.

Lastly, this makes it much harder for user-space to predict and control
the start_time they get assigned.  Previously, user-space could fork a
process and stall it in copy_thread_tls() before its pid is allocated,
but after its start_time is recorded.  This can be misused to later-on
cycle through PIDs and resume the stalled fork(2) yielding a process
that has the same pid and start_time as a process that existed before.
This can be used to circumvent security systems that identify processes
by their pid+start_time combination.

Even though user-space was always aware that start_time recording is
flaky (but several projects are known to still rely on start_time-based
identification), changing the start_time to be recorded late will help
mitigate existing attacks and make it much harder for user-space to
control the start_time a process gets assigned.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3f2e4e1d9a6cffa95d31b7a491243d5e92a82507)

Orabug: 29850581
CVE: CVE-2019-6133

Signed-off-by: John Donnelly <john.p.donnelly@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm: avoid taking zone lock in pagetypeinfo_showmixed()

pagetypeinfo_showmixedcount_print is found to take a lot of time to
complete and it does this holding the zone lock and disabling
interrupts. In some cases it is found to take more than a second (On a
2.4GHz,8Gb RAM,arm64 cpu).

Avoid taking the zone lock similar to what is done by read_page_owner,
which means possibility of inaccurate results.

Link: http://lkml.kernel.org/r/1498045643-12257-1-git-send-email-vinmenon@codeaurora.org
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: zhongjiang <zhongjiang@huawei.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 29905302
(cherry picked from commit 727c080f03e7e2e20e868efd461d4f1022b61d9b)
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Tong Chen <tong.c.chen@oracle.com>
Conflicts:
mm/page_owner.c
mm/vmstat.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/retpoline/ia32entry: Convert to non-speculative calls

Convert indirect jumps in 32-bit compat entry assembler code to use
non-speculative sequences when CONFIG_RETPOLINE is enabled.

The ia32entry code does not care about the length of the CALL_NOSPEC
fragment, so unlike similar indirect callsites in entry_64.S we use
CALL_NOSPEC everywhere.

Orabug: 29909295
CVE: CVE-2017-5715

Based on entry_64.S changes in upstream commit
2641f08bb7fc63a636a2b18173221d7040a3512e.

Reported-by: Jamie Iles <jamie.iles@oracle.com>
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

tun: call dev_get_valid_name() before register_netdevice()

register_netdevice() could fail early when we have an invalid
dev name, in which case ->ndo_uninit() is not called. For tun
device, this is a problem because a timer etc. are already
initialized and it expects ->ndo_uninit() to clean them up.

We could move these initializations into a ->ndo_init() so
that register_netdevice() knows better, however this is still
complicated due to the logic in tun_detach().

Therefore, I choose to just call dev_get_valid_name() before
register_netdevice(), which is quicker and much easier to audit.
And for this specific case, it is already enough.

Fixes: 96442e42429e ("tuntap: choose the txq based on rxq")
Reported-by: Dmitry Alexeev <avekceeb@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 0ad646c81b2182f7fa67ec0c8c825e0ee165696d)

Orabug: 29925555
CVE: CVE-2018-7191

Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

mm/madvise.c: fix madvise() infinite loop under special circumstances

MADVISE_WILLNEED has always been a noop for DAX (formerly XIP) mappings.
Unfortunately madvise_willneed() doesn't communicate this information
properly to the generic madvise syscall implementation.  The calling
convention is quite subtle there.  madvise_vma() is supposed to either
return an error or update &prev otherwise the main loop will never
advance to the next vma and it will keep looping for ever without a way
to get out of the kernel.

It seems this has been broken since introduction.  Nobody has noticed
because nobody seems to be using MADVISE_WILLNEED on these DAX mappings.

[mhocko@suse.com: rewrite changelog]
Link: http://lkml.kernel.org/r/20171127115318.911-1-guoxuenan@huawei.com
Fixes: fe77ba6f4f97 ("[PATCH] xip: madvice/fadvice: execute in place")
Signed-off-by: chenjie <chenjie6@huawei.com>
Signed-off-by: guoxuenan <guoxuenan@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: zhangyi (F) <yi.zhang@huawei.com>
Cc: Miao Xie <miaoxie@huawei.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 6ea8d958a2c95a1d514015d4e29ba21a8c0a1a91)

Orabug: 29925610
CVE: CVE-2017-18208

Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

vxlan: fix use-after-free on deletion (part 2)

After commit 15a48cc22bf03434 (vxlan: fix use-after-free on deletion)
vxlan_dellink doesn't need to take out the vxlan_dev from the hlist.
Because, that is done in vxlan_vs_del_dev now.
Having it at both the places results in system crash sometimes with
the following stack trace

general protection fault: 0000 [#1] SMP
...
vxlan_sock_release+0x6a/0xd0 [vxlan]
Call Trace:
vxlan_stop+0xee/0x310 [vxlan]
__dev_close_many+0xa8/0x110
dev_close_many+0x98/0x140
rollback_registered_many+0x10d/0x310
unregister_netdevice_many+0x19/0x60
default_device_exit_batch+0x164/0x190
? abort_exclusive_wait+0xb0/0xb0
ops_exit_list.isra.4+0x59/0x70
cleanup_net+0x1f0/0x380
process_one_work+0x165/0x470
worker_thread+0x112/0x540
? rescuer_thread+0x3f0/0x3f0
kthread+0xda/0xf0

It crashes here:

static void vxlan_vs_del_dev(struct vxlan_dev *vxlan)
{
struct vxlan_net *vn = net_generic(vxlan->net, vxlan_net_id);

spin_lock(&vn->sock_lock);
hlist_del_init_rcu(&vxlan->hlist); <--- here
spin_unlock(&vn->sock_lock);
}

Orabug: 29927196

Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

vxlan: use a more suitable function when assigning NULL

When stopping the vxlan interface we detach it from the socket.
Use RCU_INIT_POINTER() and not rcu_assign_pointer() to do so.

Orabug: 29927196

Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 57d88182ea3e8763111882671fd7462289272f64)
Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

vxlan: avoid using stale vxlan socket.

When vxlan device is closed vxlan socket is freed. This
operation can race with vxlan-xmit function which
dereferences vxlan socket. Following patch uses RCU
mechanism to avoid this situation.

Orabug: 29927196

Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c6fcc4fc5f8b592600c7409e769ab68da0fb1eca)
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/vxlan.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/microcode: add SPEC_CTRL_SSBD to x86_spec_ctrl_mask on late loading.

This is required so that we don't filter out the SPEC_CTRL_SSBD bit from
what the guest is trying to write to the MSR_IA32_SPEC_CTRL in
x86_virt_spec_ctrl(). Failure to do would make it look like the guest
correctly enabled SSBD when it did not, as reading back the MSR from the
guest would not show the bit was filtered out, giving a false sense of
security.

Orabug: 29642139

Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

block: do not use interruptible wait anywhere

When blk_queue_enter() waits for a queue to unfreeze, or unset the
PREEMPT_ONLY flag, do not allow it to be interrupted by a signal.

The PREEMPT_ONLY flag was introduced later in commit 3a0a529971ec
("block, scsi: Make SCSI quiesce and resume work reliably").  Note the SCSI
device is resumed asynchronously, i.e. after un-freezing userspace tasks.

So that commit exposed the bug as a regression in v4.15.  A mysterious
SIGBUS (or -EIO) sometimes happened during the time the device was being
resumed.  Most frequently, there was no kernel log message, and we saw Xorg
or Xwayland killed by SIGBUS.[1]

[1] E.g. https://bugzilla.redhat.com/show_bug.cgi?id=1553979

Without this fix, I get an IO error in this test:

  while killall -SIGUSR1 dd; do sleep 0.1; done & \
  echo mem > /sys/power/state ; \
  sleep 5; killall dd  # stop after 5 seconds

The interruptible wait was added to blk_queue_enter in
commit 3ef28e83ab15 ("block: generic request_queue reference counting").
Before then, the interruptible wait was only in blk-mq, but I don't think
it could ever have been correct.

Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alan Jenkins <alan.christopher.jenkins@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
(cherry picked from commit 1dc3039bc87ae7d19a990c3ee71cfd8a9068f428)

Orabug: 29674055

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Conflicts:
block/blk-core.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

vxlan: fix use-after-free on deletion

Adding a vxlan interface to a socket isn't symmetrical, while adding
is done in vxlan_open() the deletion is done in vxlan_dellink().
This can cause a use-after-free error when we close the vxlan
interface before deleting it.

We add vxlan_vs_del_dev() to match vxlan_vs_add_dev() and call
it from vxlan_stop() to match the call from vxlan_open().

Orabug: 29755932

Fixes: 56ef9c909b40 ("vxlan: Move socket initialization to within rtnl scope")
Acked-by: Jiri Benc <jbenc@redhat.com>
Tested-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Mark Bloch <markb@mellanox.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit a53cb29b0af346af44e4abf13d7e59f807fba690)

Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/vxlan.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

vxlan: reduce usage of synchronize_net in ndo_stop

We only need to do the synchronize_net dance once for both, ipv4 and
ipv6 sockets, thus removing one synchronize_net in case both sockets get
dismantled.

Orabug: 29755932

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 544a773a01828e3cc3b553721f68d880d0d27a97)

Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

vxlan: synchronously and race-free destruction of vxlan sockets

Due to the fact that the udp socket is destructed asynchronously in a
work queue, we have some nondeterministic behavior during shutdown of
vxlan tunnels and creating new ones. Fix this by keeping the destruction
process synchronous in regards to the user space process so IFF_UP can
be reliably set.

udp_tunnel_sock_release destroys vs->sock->sk if reference counter
indicates so. We expect to have the same lifetime of vxlan_sock and
vxlan_sock->sock->sk even in fast paths with only rcu locks held. So
only destruct the whole socket after we can be sure it cannot be found
by searching vxlan_net->sock_list.

Orabug: 29755932

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jiri Benc <jbenc@redhat.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 0412bd931f5f94d1054e958415c4a945d8ee62f4)

Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

vxlan: support both IPv4 and IPv6 sockets in a single vxlan device

For metadata based vxlan interface, open both IPv4 and IPv6 socket. This is
much more user friendly: it's not necessary to create two vxlan interfaces
and pay attention to using the right one in routing rules.

Orabug: 29755932

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit b1be00a6c39fda2ec380e168d7bcf96fb8c9da42)

Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/vxlan.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

vxlan: make vxlan_sock_add and vxlan_sock_release complementary

Make vxlan_sock_add both alloc the socket and attach it to vxlan_dev. Let
vxlan_sock_release accept vxlan_dev as its parameter instead of vxlan_sock.

This makes vxlan_sock_add and vxlan_sock release complementary. It reduces
code duplication in the next patch.

Orabug: 29755932

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 205f356d165033443793a97a668a203a79a8723a)

Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

openvswitch: Re-add CONFIG_OPENVSWITCH_VXLAN

This readds the config option CONFIG_OPENVSWITCH_VXLAN to avoid a
hard dependency of OVS on VXLAN. It moves the VXLAN config compat
code to vport-vxlan.c and allows compliation as a module.

Orabug: 29755932

Fixes: 614732eaa12d ("openvswitch: Use regular VXLAN net_device device")
Fixes: 2661371ace96 ("openvswitch: fix compilation when vxlan is a module")
Cc: Pravin B Shelar <pshelar@nicira.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit dcc38c033b32b81b88b798f0c0b8453839ac996b)

Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
net/openvswitch/vport-netdev.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

openvswitch: Use regular VXLAN net_device device

This gets rid of all OVS specific VXLAN code in the receive and
transmit path by using a VXLAN net_device to represent the vport.
Only a small shim layer remains which takes care of handling the
VXLAN specific OVS Netlink configuration.

Unexports vxlan_sock_add(), vxlan_sock_release(), vxlan_xmit_skb()
since they are no longer needed.

Orabug: 29755932

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 614732eaa12dd462c0ab274700bed14f36afea5e)

Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/vxlan.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

openvswitch: Abstract vport name through ovs_vport_name()

This allows to get rid of the get_name() vport ops later on.

Orabug: 29755932

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c9db965c524ea27451e60d5ddcd242f6c33a70fd)

Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

openvswitch: Move dev pointer into vport itself

This is the first step in representing all OVS vports as regular
struct net_devices. Move the net_device pointer into the vport
structure itself to get rid of struct vport_netdev.

Orabug: 29755932

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit be4ace6e6b1bc12e18b25fe764917e09a1f96d7b)

Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ip_tunnel: Make ovs_tunnel_info and ovs_key_ipv4_tunnel generic

Rename the tunnel metadata data structures currently internal to
OVS and make them generic for use by all IP tunnels.

Both structures are kernel internal and will stay that way. Their
members are exposed to user space through individual Netlink
attributes by OVS. It will therefore be possible to extend/modify
these structures without affecting user ABI.

Orabug: 29755932

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 1d8fff907342d2339796dbd27ea47d0e76a6a2d0)

Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

vxlan: Factor out device configuration

This factors out the device configuration out of the RTNL newlink
API which allows for in-kernel creation of VXLAN net_devices.

Orabug: 29755932

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 0dfbdf4102b9303d3ddf2177c0220098ff99f6de)

Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/vxlan.c

Signed-off-by: Brian Maly <brian.maly@oracle.com>

kexec: generate VMCOREINFO for module symbols

This commit is one of the three commit for generating VMCOREINFO
symbol information.  See comments in previous two commits.

To dump kallsyms, module symbols must also be included.  The
symbol table of each module can be located through the module
list.  This commit generates necessary symbol information for
accessing the symbol tables of loaded modules.

Orabug: 29770217
Signed-off-by: Isaac Chen <isaac.chen@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

kexec: generate VMCOREINFO for tasks and pid

This commit is the continuation of the previous commit.
For more information see the previous commit comments.

This commit changes the VMCOREINFO buffer size from 4k to 8k, and
generates more symbol information for tasks and pid.

Orabug: 29770217
Signed-off-by: Isaac Chen <isaac.chen@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

kexec: generate VMCOREINFO for trace dump

The goal for this commit (and the next two) is to enable tools
to dump the trace buffer from /proc/vmcore at kdump time, without
requiring the kernel debug info package being installed.

makedumpfile processes VMCOREINFO section to dump dmesg buffer.
Extending VMCOREINFO to facilitate dumping trace buffer can be
very helpful in diagnosing system crash.

This is part one of three commits to generate symbol information
into VMCOREINFO section.

Orabug: 29770217
Signed-off-by: Isaac Chen <isaac.chen@oracle.com>
Reviewed-by: Joe Jin <joe.jin@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Add CVE numbers for CVE-2019-11477 CVE-2019-11478 CVE-2019-11479

CVE-2019-11477
- tcp: limit payload size of sacked skbs
- tcp: fix fack_count accounting on tcp_shift_skb_data()
- tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()

CVE-2019-11478
- tcp: tcp_fragment() should apply sane memory limits

CVE-2019-11479
- tcp: add tcp_min_snd_mss sysctl

Orabug: 29890820, 29886598, 29884306
CVE: CVE-2019-11477, CVE-2019-11478, CVE-2019-11479

Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>

tcp: fix fack_count accounting on tcp_shift_skb_data()

v4.15 or since commit 737ff314563 ("tcp: use sequence distance to
detect reordering") had switched from the packet-based FACK tracking
to sequence-based.

v4.14 and older still have the old logic and hence on
tcp_skb_shift_data() needs to retain its original logic and have
@fack_count in sync. In other words, we keep the increment of pcount with
tcp_skb_pcount(skb) to later used that to update fack_count. To make it
more explicit we track the new skb that gets incremented to pcount in
@next_pcount, and we get to avoid the constant invocation of
tcp_skb_pcount(skb) all together.

Orabug: 29890820
Fixes: 1b56e4cb5dec ("tcp: limit payload size of sacked skbs")
Reported-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Reviewed-by: Rao Shoaib <rao.shoaib@oracle.com>>>>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()

If mtu probing is enabled tcp_mtu_probing() could very well end up
with a too small MSS.

Use the new sysctl tcp_min_snd_mss to make sure MSS search
is performed in an acceptable range.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Jonathan Looney <jtl@netflix.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Bruce Curtis <brucec@netflix.com>
Orabug: 29886598
Reviewed-by: John Haxby <john.haxby@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

tcp: add tcp_min_snd_mss sysctl

Some TCP peers announce a very small MSS option in their SYN and/or
SYN/ACK messages.

This forces the stack to send packets with a very high network/cpu
overhead.

Linux has enforced a minimal value of 48. Since this value includes
the size of TCP options, and that the options can consume up to 40
bytes, this means that each segment can include only 8 bytes of payload.

In some cases, it can be useful to increase the minimal value
to a saner value.

We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility
reasons.

Note that TCP_MAXSEG socket option enforces a minimal value
of (TCP_MIN_MSS). David Miller increased this minimal value
in commit c39508d6f118 ("tcp: Make TCP_MAXSEG minimum more correct.")
from 64 to 88.

We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Jonathan Looney <jtl@netflix.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Bruce Curtis <brucec@netflix.com>
Orabug: 29884306
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[For backport we had to wrap sysctl_tcp_min_snd_mss from struct netns_ipv4
in the __GENKSYMS__ gunk. There is a nice 4 byte hole in it, so we fit it
within that structure. The kernel is the one responsible for creating the
structure so no danger of third-party drivers handing us a shorter structure.

Pahole is OK too:

struct linux_xfrm_mib {
long unsigned int          mibs[29];             /*     0   232 */
@@ -9096,16 +9096,14 @@
/* --- cacheline 6 boundary (384 bytes) --- */
struct ping_group_range    ping_group_range;     /*   384    16 */
atomic_t                   dev_addr_genid;       /*   400     4 */
-
- /* XXX 4 bytes hole, try to pack */
-
+ int                        sysctl_tcp_min_snd_mss; /*   404     4 */
long unsigned int *        sysctl_local_reserved_ports; /*   408     8 */
struct list_head           mr_tables;            /*   416    16 */
struct fib_rules_ops *     mr_rules_ops;         /*   432     8 */
atomic_t                   rt_genid;             /*   440     4 */

- /* size: 448, cachelines: 7, members: 50 */
- /* sum members: 390, holes: 5, sum holes: 54 */
+ /* size: 448, cachelines: 7, members: 51 */
+ /* sum members: 394, holes: 4, sum holes: 50 */
/* padding: 4 */
/* paddings: 1, sum paddings: 12 */
};
]

tcp: tcp_fragment() should apply sane memory limits

Jonathan Looney reported that a malicious peer can force a sender
to fragment its retransmit queue into tiny skbs, inflating memory
usage and/or overflow 32bit counters.

TCP allows an application to queue up to sk_sndbuf bytes,
so we need to give some allowance for non malicious splitting
of retransmit queue.

A new SNMP counter is added to monitor how many times TCP
did not allow to split an skb if the allowance was exceeded.

Note that this counter might increase in the case applications
use SO_SNDBUF socket option to lower sk_sndbuf.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Bruce Curtis <brucec@netflix.com>
Orabug: 29884306
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

tcp: limit payload size of sacked skbs

Jonathan Looney reported that TCP can trigger the following crash
in tcp_shifted_skb() :

BUG_ON(tcp_skb_pcount(skb) < pcount);

This can happen if the remote peer has advertized the smallest
MSS that linux TCP accepts : 48

An skb can hold 17 fragments, and each fragment can hold 32KB
on x86, or 64KB on PowerPC.

This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
can overflow.

Note that tcp_sendmsg() builds skbs with less than 64KB
of payload, so this problem needs SACK to be enabled.
SACK blocks allow TCP to coalesce multiple skbs in the retransmit
queue, thus filling the 17 fragments to maximal capacity.

Fixes: 832d11c5cd07 ("tcp: Try to restore large SKBs while SACK processing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Bruce Curtis <brucec@netflix.com>
tcp_collapse_retrans() is quite different in UEK4 and needs special
review. No change is needed in tcp_collapse_retrans() because
in UEK4 it is called only after checking with skb_availroom() for
available room and that the skb is linear. That is not the case in later
releases.

Arguments to tcp_shifted_skb() are different compared to the original patch,
but the difference is inconsequential to issue being addressed.

In UEK4, TCP_SKB_CB(skb)->tcp_gso_segs is 32 bits.
So the original 16-bit overflow issue does not exist.
However, it is prudent to limit UEK4 as well.

Orabug: 29884306
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

hugetlbfs: don't retry when pool page allocations start to fail

When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages,
the pages will be interleaved between all nodes of the system.  If
nodes are not equal, it is quite possible for one node to fill up
before the others.  When this happens, the code still attempts to
allocate pages from the full node.  This results in calls to direct
reclaim and compaction which slow things down considerably.

When allocating pool pages, note the state of the previous allocation
for each node.  If previous allocation failed, do not use the
aggressive retry algorithm on successive attempts.  The allocation
will still succeed if there is memory available, but it will not try
as hard to free up memory.

In routine hugetlb_hstate_alloc_pages, do not allocate the bitmap for
gigantic pages as this routine is called before runtime memory
allocators are available.  Also, note that the bitmap is not necessary
in the case of boot time allocation of gigantic pages.

Orabug: 29324267

Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: John Donnelly <john.p.donnelly@oracle.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: RSB stuffing with retpoline on Skylake+ cpus

Skylake processors have different RSB behavior than other processors
when the RSB is empty, and need to use half RSB stuffing to avoid RSB
underflow.

The STUFF_RSB macro is used in conjunction with 2 dynamic keys to
perform RSB stuffing of half or full RSB entries.

Signed-off-by: William Roche <william.roche@oracle.com>
Co-developed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
(cherry picked from commit 5845d83cfa6fd8a2afe1e01b4bda27d21e83f7a7)

Orabug: 29660924

Signed-off-by: William Roche <william.roche@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
bugs.c vs bugs_64.c in UEK4
- the "__init" attribute on the is_skylake_era() function doesn't need
to be removed -- it's not there in UEK4.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: reformatting RSB overwrite macro

Introducing the __ASM_RSB_STUFF code directly into the macro to prepare
the possibility to do both half RSB stuffing and RSB overwriting with it.

Signed-off-by: William Roche <william.roche@oracle.com>
Co-developed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
(cherry picked from commit fe603bb0b80036523fb34b6c035d1ee66a26262f)

Orabug: 29660924

Signed-off-by: William Roche <william.roche@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/spec_ctrl.h

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: Dynamic enable and disable of RSB stuffing with IBRS&!SMEP

As IBRS is a dynamic feature, RSB overwrite needs to also be dynamically
activated and disabled with IBRS (if SMEP is not available).

Signed-off-by: William Roche <william.roche@oracle.com>
Co-developed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
(cherry picked from commit fe61cd8181780903b2e092e2079ee09a6eb733d9)

Orabug: 29660924

Signed-off-by: William Roche <william.roche@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/kernel/cpu/bugs.c
arch/x86/kernel/cpu/spec_ctrl.c
bugs.c vs bugs_64.c in UEK4
spec_ctrl.c code still in bugs_64.c on UEK4

Signed-off-by: Brian Maly <brian.maly@oracle.com>

x86/speculation: STUFF_RSB dynamic enable

The STUFF_RSB overwrite macro can be enabled dynamically with
rsb_overwrite_key instead of using X86_FEATURE_STUFF_RSB.

Signed-off-by: William Roche <william.roche@oracle.com>
Co-developed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
(cherry picked from commit 84e09871beb92364bd374d8c3bc3441a8c4be593)

Orabug: 29660924

Signed-off-by: William Roche <william.roche@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/cpufeatures.h
arch/x86/include/asm/spec_ctrl.h
arch/x86/kernel/cpu/bugs.c
cpufeatures.h vs cpufeature.h in UEK4
include <linux/jump_label.h> header in spec_ctrl.h to use this feature
bugs.c vs bugs_64.c in UEK4

Signed-off-by: Brian Maly <brian.maly@oracle.com>

int3 handler better address space detection on interrupts

In order to prepare the possibility to dynamically change an
interrupt handler code with static_branch_enable/disable,
as the interrupt can equally appear while in user space or
kernel space, the int3 handler itself must better identify if
the original interrupt is from kernel or userland.

Signed-off-by: William Roche <william.roche@oracle.com>
Co-developed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
(cherry picked from commit 594fc07cd96784004254680c9e1e4b757fb0a1f5)

Orabug: 29660924

Signed-off-by: William Roche <william.roche@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Acked-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/entry/entry_64.S
entry/entry_64.S vs kernel/entry_64.S in UEK4

Signed-off-by: Brian Maly <brian.maly@oracle.com>

repairing out-of-tree build functionality

The current uek4 tree (only) cannot build the binrpm-pkg with an objdir
outside the source tree. This fix redirects the, incorrectly, placed
generated firmware files and firmware parsers into the objtree.

Orabug: 29755100

Signed-off-by: Mark Nicholson <mark.j.nicholson@oracle.com>
Reviewed-by: Tianyue Lan <tianyue.lan@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ext4: fix false negatives *and* false positives in ext4_check_descriptors()

Ext4_check_descriptors() was getting called before s_gdb_count was
initialized.  So for file systems w/o the meta_bg feature, allocation
bitmaps could overlap the block group descriptors and ext4 wouldn't
notice.

For file systems with the meta_bg feature enabled, there was a
fencepost error which would cause the ext4_check_descriptors() to
incorrectly believe that the block allocation bitmap overlaps with the
block group descriptor blocks, and it would reject the mount.

Fix both of these problems.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
(cherry picked from commit 44de022c4382541cebdd6de4465d1f4f465ff1dd)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
    fs/ext4/super.c
    [The contextual has been changed]

Orabug: 29797007

Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

ocfs2: fix ocfs2 read inode data panic in ocfs2_iget

In some cases, ocfs2_iget() reads the data of inode, which has been
deleted for some reason.  That will make the system panic.  So We should
judge whether this inode has been deleted, and tell the caller that the
inode is a bad inode.

For example, the ocfs2 is used as the backed of nfs, and the client is
nfsv3.  This issue can be reproduced by the following steps.

on the nfs server side,
..../patha/pathb

Step 1: The process A was scheduled before calling the function fh_verify.

Step 2: The process B is removing the 'pathb', and just completed the call
to function dput.  Then the dentry of 'pathb' has been deleted from the
dcache, and all ancestors have been deleted also.  The relationship of
dentry and inode was deleted through the function hlist_del_init.  The
following is the call stack.
dentry_iput->hlist_del_init(&dentry->d_u.d_alias)

At this time, the inode is still in the dcache.

Step 3: The process A call the function ocfs2_get_dentry, which get the
inode from dcache.  Then the refcount of inode is 1.  The following is the
call stack.
nfsd3_proc_getacl->fh_verify->exportfs_decode_fh->fh_to_dentry(ocfs2_get_dentry)

Step 4: Dirty pages are flushed by bdi threads.  So the inode of 'patha'
is evicted, and this directory was deleted.  But the inode of 'pathb'
can't be evicted, because the refcount of the inode was 1.

Step 5: The process A keep running, and call the function
reconnect_path(in exportfs_decode_fh), which call function
ocfs2_get_parent of ocfs2.  Get the block number of parent
directory(patha) by the name of ...  Then read the data from disk by the
block number.  But this inode has been deleted, so the system panic.

Process A                                             Process B
1. in nfsd3_proc_getacl                   |
2.                                        |        dput
3. fh_to_dentry(ocfs2_get_dentry)         |
4. bdi flush dirty cache                  |
5. ocfs2_iget                             |

[283465.542049] OCFS2: ERROR (device sdp): ocfs2_validate_inode_block:
Invalid dinode #580640: OCFS2_VALID_FL not set

[283465.545490] Kernel panic - not syncing: OCFS2: (device sdp): panic forced
after error

[283465.546889] CPU: 5 PID: 12416 Comm: nfsd Tainted: G        W
4.1.12-124.18.6.el6uek.bug28762940v3.x86_64 #2
[283465.548382] Hardware name: VMware, Inc. VMware Virtual Platform/440BX
Desktop Reference Platform, BIOS 6.00 09/21/2015
[283465.549657]  0000000000000000 ffff8800a56fb7b8 ffffffff816e839c
ffffffffa0514758
[283465.550392]  000000000008dc20 ffff8800a56fb838 ffffffff816e62d3
0000000000000008
[283465.551056]  ffff880000000010 ffff8800a56fb848 ffff8800a56fb7e8
ffff88005df9f000
[283465.551710] Call Trace:
[283465.552516]  [<ffffffff816e839c>] dump_stack+0x63/0x81
[283465.553291]  [<ffffffff816e62d3>] panic+0xcb/0x21b
[283465.554037]  [<ffffffffa04e66b0>] ocfs2_handle_error+0xf0/0xf0 [ocfs2]
[283465.554882]  [<ffffffffa04e7737>] __ocfs2_error+0x67/0x70 [ocfs2]
[283465.555768]  [<ffffffffa049c0f9>] ocfs2_validate_inode_block+0x229/0x230
[ocfs2]
[283465.556683]  [<ffffffffa047bcbc>] ocfs2_read_blocks+0x46c/0x7b0 [ocfs2]
[283465.557408]  [<ffffffffa049bed0>] ? ocfs2_inode_cache_io_unlock+0x20/0x20
[ocfs2]
[283465.557973]  [<ffffffffa049f0eb>] ocfs2_read_inode_block_full+0x3b/0x60
[ocfs2]
[283465.558525]  [<ffffffffa049f5ba>] ocfs2_iget+0x4aa/0x880 [ocfs2]
[283465.559082]  [<ffffffffa049146e>] ocfs2_get_parent+0x9e/0x220 [ocfs2]
[283465.559622]  [<ffffffff81297c05>] reconnect_path+0xb5/0x300
[283465.560156]  [<ffffffff81297f46>] exportfs_decode_fh+0xf6/0x2b0
[283465.560708]  [<ffffffffa062faf0>] ? nfsd_proc_getattr+0xa0/0xa0 [nfsd]
[283465.561262]  [<ffffffff810a8196>] ? prepare_creds+0x26/0x110
[283465.561932]  [<ffffffffa0630860>] fh_verify+0x350/0x660 [nfsd]
[283465.562862]  [<ffffffffa0637804>] ? nfsd_cache_lookup+0x44/0x630 [nfsd]
[283465.563697]  [<ffffffffa063a8b9>] nfsd3_proc_getattr+0x69/0xf0 [nfsd]
[283465.564510]  [<ffffffffa062cf60>] nfsd_dispatch+0xe0/0x290 [nfsd]
[283465.565358]  [<ffffffffa05eb892>] ? svc_tcp_adjust_wspace+0x12/0x30
[sunrpc]
[283465.566272]  [<ffffffffa05ea652>] svc_process_common+0x412/0x6a0 [sunrpc]
[283465.567155]  [<ffffffffa05eaa03>] svc_process+0x123/0x210 [sunrpc]
[283465.568020]  [<ffffffffa062c90f>] nfsd+0xff/0x170 [nfsd]
[283465.568962]  [<ffffffffa062c810>] ? nfsd_destroy+0x80/0x80 [nfsd]
[283465.570112]  [<ffffffff810a622b>] kthread+0xcb/0xf0
[283465.571099]  [<ffffffff810a6160>] ? kthread_create_on_node+0x180/0x180
[283465.572114]  [<ffffffff816f11b8>] ret_from_fork+0x58/0x90
[283465.573156]  [<ffffffff810a6160>] ? kthread_create_on_node+0x180/0x180

Link: http://lkml.kernel.org/r/1554185919-3010-1-git-send-email-sunny.s.zhang@oracle.com
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: piaojun <piaojun@huawei.com>
Cc: "Gang He" <ghe@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e091eab028f9253eac5c04f9141bbc9d170acab3)

Orabug: 29233739

Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com>
Reviewed-by: John Donnelly <John.p.donnelly@Oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Bluetooth: Verify that l2cap_get_conf_opt provides large enough buffer

The function l2cap_get_conf_opt will return L2CAP_CONF_OPT_SIZE + opt->len
as length value. The opt->len however is in control over the remote user
and can be used by an attacker to gain access beyond the bounds of the
actual packet.

To prevent any potential leak of heap memory, it is enough to check that
the resulting len calculation after calling l2cap_get_conf_opt is not
below zero. A well formed packet will always return >= 0 here and will
end with the length value being zero after the last option has been
parsed. In case of malformed packets messing with the opt->len field the
length value will become negative. If that is the case, then just abort
and ignore the option.

In case an attacker uses a too short opt->len value, then garbage will
be parsed, but that is protected by the unknown option handling and also
the option parameter size checks.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Orabug: 29526426
CVE: CVE-2019-3459
(cherry picked from commit 7c9cbd0b5e38a1672fcd137894ace3b042dfbf69)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

Bluetooth: Check L2CAP option sizes returned from l2cap_get_conf_opt

When doing option parsing for standard type values of 1, 2 or 4 octets,
the value is converted directly into a variable instead of a pointer. To
avoid being tricked into being a pointer, check that for these option
types that sizes actually match. In L2CAP every option is fixed size and
thus it is prudent anyway to ensure that the remote side sends us the
right option size along with option paramters.

If the option size is not matching the option type, then that option is
silently ignored. It is a protocol violation and instead of trying to
give the remote attacker any further hints just pretend that option is
not present and proceed with the default values. Implementation
following the specification and its qualification procedures will always
use the correct size and thus not being impacted here.

To keep the code readable and consistent accross all options, a few
cosmetic changes were also required.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Orabug: 29526426
CVE: CVE-2019-3459
(cherry picked from commit af3d5d1c87664a4f150fcf3534c6567cb19909b0)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

HID: debug: fix the ring buffer implementation

commit 13054abbaa4f1fd4e6f3b4b63439ec033b4c8035 upstream.

Ring buffer implementation in hid_debug_event() and hid_debug_events_read()
is strange allowing lost or corrupted data. After commit 717adfdaf147
("HID: debug: check length before copy_to_user()") it is possible to enter
an infinite loop in hid_debug_events_read() by providing 0 as count, this
locks up a system. Fix this by rewriting the ring buffer implementation
with kfifo and simplify the code.

This fixes CVE-2019-3819.

v2: fix an execution logic and add a comment
v3: use __set_current_state() instead of set_current_state()

Backport to v4.4: some (tree-wide) patches are missing in v4.4 so
cherry-pick relevant pieces from:
* 6396bb22151 ("treewide: kzalloc() -> kcalloc()")
* a9a08845e9ac ("vfs: do bulk POLL* -> EPOLL* replacement")
* 92529623d242 ("HID: debug: improve hid_debug_event()")
* 174cd4b1e5fb ("sched/headers: Prepare to move signal wakeup & sigpending
methods from <linux/sched.h> into <linux/sched/signal.h>")

Link: https://bugzilla.redhat.com/show_bug.cgi?id=1669187
Cc: stable@vger.kernel.org # v4.18+
Fixes: cd667ce24796 ("HID: use debugfs for events/reports dumping")
Fixes: 717adfdaf147 ("HID: debug: check length before copy_to_user()")
Signed-off-by: Vladis Dronov <vdronov@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b661fff5f8a0f19824df91cc3905ba2c5b54dc87)

Orabug: 29629481
CVE: CVE-2019-3819

Reviewed-by: John Donnelly <John.p.donnelly@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: target: iscsi: Use hex2bin instead of a re-implementation

commit 1816494330a83f2a064499d8ed2797045641f92c upstream.

This change has the following effects, in order of descreasing importance:

1) Prevent a stack buffer overflow

2) Do not append an unnecessary NULL to an anyway binary buffer, which
   is writing one byte past client_digest when caller is:
   chap_string_to_hex(client_digest, chap_r, strlen(chap_r));

The latter was found by KASAN (see below) when input value hes expected size
(32 hex chars), and further analysis revealed a stack buffer overflow can
happen when network-received value is longer, allowing an unauthenticated
remote attacker to smash up to 17 bytes after destination buffer (16 bytes
attacker-controlled and one null).  As switching to hex2bin requires
specifying destination buffer length, and does not internally append any null,
it solves both issues.

This addresses CVE-2018-14633.

Beyond this:

- Validate received value length and check hex2bin accepted the input, to log
  this rejection reason instead of just failing authentication.

- Only log received CHAP_R and CHAP_C values once they passed sanity checks.

==================================================================
BUG: KASAN: stack-out-of-bounds in chap_string_to_hex+0x32/0x60 [iscsi_target_mod]
Write of size 1 at addr ffff8801090ef7c8 by task kworker/0:0/1021

CPU: 0 PID: 1021 Comm: kworker/0:0 Tainted: G           O      4.17.8kasan.sess.connops+ #2
Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 05/19/2014
Workqueue: events iscsi_target_do_login_rx [iscsi_target_mod]
Call Trace:
dump_stack+0x71/0xac
print_address_description+0x65/0x22e
? chap_string_to_hex+0x32/0x60 [iscsi_target_mod]
kasan_report.cold.6+0x241/0x2fd
chap_string_to_hex+0x32/0x60 [iscsi_target_mod]
chap_server_compute_md5.isra.2+0x2cb/0x860 [iscsi_target_mod]
? chap_binaryhex_to_asciihex.constprop.5+0x50/0x50 [iscsi_target_mod]
? ftrace_caller_op_ptr+0xe/0xe
? __orc_find+0x6f/0xc0
? unwind_next_frame+0x231/0x850
? kthread+0x1a0/0x1c0
? ret_from_fork+0x35/0x40
? ret_from_fork+0x35/0x40
? iscsi_target_do_login_rx+0x3bc/0x4c0 [iscsi_target_mod]
? deref_stack_reg+0xd0/0xd0
? iscsi_target_do_login_rx+0x3bc/0x4c0 [iscsi_target_mod]
? is_module_text_address+0xa/0x11
? kernel_text_address+0x4c/0x110
? __save_stack_trace+0x82/0x100
? ret_from_fork+0x35/0x40
? save_stack+0x8c/0xb0
? 0xffffffffc1660000
? iscsi_target_do_login+0x155/0x8d0 [iscsi_target_mod]
? iscsi_target_do_login_rx+0x3bc/0x4c0 [iscsi_target_mod]
? process_one_work+0x35c/0x640
? worker_thread+0x66/0x5d0
? kthread+0x1a0/0x1c0
? ret_from_fork+0x35/0x40
? iscsi_update_param_value+0x80/0x80 [iscsi_target_mod]
? iscsit_release_cmd+0x170/0x170 [iscsi_target_mod]
chap_main_loop+0x172/0x570 [iscsi_target_mod]
? chap_server_compute_md5.isra.2+0x860/0x860 [iscsi_target_mod]
? rx_data+0xd6/0x120 [iscsi_target_mod]
? iscsit_print_session_params+0xd0/0xd0 [iscsi_target_mod]
? cyc2ns_read_begin.part.2+0x90/0x90
? _raw_spin_lock_irqsave+0x25/0x50
? memcmp+0x45/0x70
iscsi_target_do_login+0x875/0x8d0 [iscsi_target_mod]
? iscsi_target_check_first_request.isra.5+0x1a0/0x1a0 [iscsi_target_mod]
? del_timer+0xe0/0xe0
? memset+0x1f/0x40
? flush_sigqueue+0x29/0xd0
iscsi_target_do_login_rx+0x3bc/0x4c0 [iscsi_target_mod]
? iscsi_target_nego_release+0x80/0x80 [iscsi_target_mod]
? iscsi_target_restore_sock_callbacks+0x130/0x130 [iscsi_target_mod]
process_one_work+0x35c/0x640
worker_thread+0x66/0x5d0
? flush_rcu_work+0x40/0x40
kthread+0x1a0/0x1c0
? kthread_bind+0x30/0x30
ret_from_fork+0x35/0x40

The buggy address belongs to the page:
page:ffffea0004243bc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x17fffc000000000()
raw: 017fffc000000000 0000000000000000 0000000000000000 00000000ffffffff
raw: ffffea0004243c20 ffffea0004243ba0 0000000000000000 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff8801090ef680: f2 f2 f2 f2 f2 f2 f2 01 f2 f2 f2 f2 f2 f2 f2 00
ffff8801090ef700: f2 f2 f2 f2 f2 f2 f2 00 02 f2 f2 f2 f2 f2 f2 00
>ffff8801090ef780: 00 f2 f2 f2 f2 f2 f2 00 00 f2 f2 f2 f2 f2 f2 00
                                              ^
ffff8801090ef800: 00 f2 f2 f2 f2 f2 f2 00 00 00 00 02 f2 f2 f2 f2
ffff8801090ef880: f2 f2 f2 00 00 00 00 00 00 00 00 f2 f2 f2 f2 00
==================================================================

Signed-off-by: Vincent Pelletier <plr.vincent@gmail.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 755e45f3155cc51e37dc1cce9ccde10b84df7d93)

Orabug: 29778875
CVE: CVE-2018-14633

Signed-off-by: John Donnelly <John.P.Donnelly@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: libsas: fix a race condition when smp task timeout

When the lldd is processing the complete sas task in interrupt and set the
task stat as SAS_TASK_STATE_DONE, the smp timeout timer is able to be
triggered at the same time. And smp_task_timedout() will complete the task
wheter the SAS_TASK_STATE_DONE is set or not. Then the sas task may freed
before lldd end the interrupt process. Thus a use-after-free will happen.

Fix this by calling the complete() only when SAS_TASK_STATE_DONE is not
set. And remove the check of the return value of the del_timer(). Once the
LLDD sets DONE, it must call task->done(), which will call
smp_task_done()->complete() and the task will be completed and freed
correctly.

Reported-by: chenxiang <chenxiang66@hisilicon.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
CC: John Garry <john.garry@huawei.com>
CC: Johannes Thumshirn <jthumshirn@suse.de>
CC: Ewan Milne <emilne@redhat.com>
CC: Christoph Hellwig <hch@lst.de>
CC: Tomas Henzl <thenzl@redhat.com>
CC: Dan Williams <dan.j.williams@intel.com>
CC: Hannes Reinecke <hare@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: John Garry <john.garry@huawei.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit b90cd6f2b905905fb42671009dc0e27c310a16ae)

Orabug: 29783225
CVE: CVE-2018-20836

Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>