www.infradead.org Git - users/jedix/linux-maple.git/log

net/llc: avoid BUG_ON() in skb_orphan()

It seems nobody used LLC since linux-3.12.

Fortunately fuzzers like syzkaller still know how to run this code,
otherwise it would be no fun.

Setting skb->sk without skb->destructor leads to all kinds of
bugs, we now prefer to be very strict about it.

Ideally here we would use skb_set_owner() but this helper does not exist yet,
only CAN seems to have a private helper for that.

Orabug: 25802599
CVE: CVE-2017-6345

Fixes: 376c7311bdb6 ("net: add a temporary sanity check in skb_orphan()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Aniket Alshi <aniket.alshi@oracle.com>
(cherry picked from commit 8b74d439e1697110c5e5c600643e823eb1dd0762)
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

ip: fix IP_CHECKSUM handling

The skbs processed by ip_cmsg_recv() are not guaranteed to
be linear e.g. when sending UDP packets over loopback with
MSGMORE.
Using csum_partial() on [potentially] the whole skb len
is dangerous; instead be on the safe side and use skb_checksum().

Thanks to syzkaller team to detect the issue and provide the
reproducer.

v1 -> v2:
- move the variable declaration in a tighter scope

Fixes: ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit ca4ef4574f1ee5252e2cd365f8f5d5bafd048f32)

Orabug: 25802576
CVE-2017-6347

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

udp: fix IP_CHECKSUM handling

First bug was added in commit ad6f939ab193 ("ip: Add offset parameter to
ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));

Then commit e6afc8ace6dd ("udp: remove headers from UDP packets before
queueing") forgot to adjust the offsets now UDP headers are pulled
before skb are put in receive queue.

Fixes: ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv")
Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Sam Kumar <samanthakumar@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Tested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 10df8e6152c6c400a563a673e9956320bfce1871)

Orabug: 25802576
CVE-2017-6347

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

udp: do not expect udp headers in recv cmsg IP_CMSG_CHECKSUM

On udp sockets, recv cmsg IP_CMSG_CHECKSUM returns a checksum over
the packet payload. Since commit e6afc8ace6dd pulled the headers,
taking skb->data as the start of transport header is incorrect. Use
the transport header pointer.

Also, when peeking at an offset from the start of the packet, only
return a checksum from the start of the peeked data. Note that the
cmsg does not subtract a tail checkum when reading truncated data.

Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 31c2e4926fe912f88388bcaa8450fcaa8f2ece47)

Orabug: 25802576
CVE-2017-6347

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

tcp: avoid infinite loop in tcp_splice_read()

Splicing from TCP socket is vulnerable when a packet with URG flag is
received and stored into receive queue.

__tcp_splice_read() returns 0, and sk_wait_data() immediately
returns since there is the problematic skb in queue.

This is a nice way to burn cpu (aka infinite loop) and trigger
soft lockups.

Again, this gem was found by syzkaller tool.

Orabug: 25802549
CVE: CVE-2017-6214

Fixes: 9c55e01c0cc8 ("[TCP]: Splice receive support.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by : Aniket Alshi <aniket.alsi@oracle.com>
(cherry picked from commit ccf7abb93af09ad0868ae9033d1ca8108bdaec82)
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

sctp: avoid BUG_ON on sctp_wait_for_sndbuf

Alexander Popov reported that an application may trigger a BUG_ON in
sctp_wait_for_sndbuf if the socket tx buffer is full, a thread is
waiting on it to queue more data and meanwhile another thread peels off
the association being used by the first thread.

This patch replaces the BUG_ON call with a proper error handling. It
will return -EPIPE to the original sendmsg call, similarly to what would
have been done if the association wasn't found in the first place.

Acked-by: Alexander Popov <alex.popov@linux.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 2dcab598484185dea7ec22219c76dcdd59e3cb90)

Orabug: 25802515
CVE-2017-5986

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

ext4: store checksum seed in superblock

Allow the filesystem to store the metadata checksum seed in the
superblock and add an incompat feature to say that we're using it.
This enables tune2fs to change the UUID on a mounted metadata_csum
FS without having to (racy!) rewrite all disk metadata.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
(cherry picked from commit 8c81bd8f586c46eaf114758a78d82895a2b081c2)

Orabug: 25802481
CVE-2016-10208

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
fs/ext4/sysfs.c

ext4: reserve code points for the project quota feature

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
(cherry picked from commit 8b4953e13f4c5d9a3c869f5fca7d51e1700e7db0)

Orabug: 25802481
CVE-2016-10208

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

ext4: validate s_first_meta_bg at mount time

Ralf Spenneberg reported that he hit a kernel crash when mounting a
modified ext4 image. And it turns out that kernel crashed when
calculating fs overhead (ext4_calculate_overhead()), this is because
the image has very large s_first_meta_bg (debug code shows it's
842150400), and ext4 overruns the memory in count_overhead() when
setting bitmap buffer, which is PAGE_SIZE.

ext4_calculate_overhead():
  buf = get_zeroed_page(GFP_NOFS);  <=== PAGE_SIZE buffer
  blks = count_overhead(sb, i, buf);

count_overhead():
  for (j = ext4_bg_num_gdb(sb, grp); j > 0; j--) { <=== j = 842150400
          ext4_set_bit(EXT4_B2C(sbi, s++), buf);   <=== buffer overrun
          count++;
  }

This can be reproduced easily for me by this script:

  #!/bin/bash
  rm -f fs.img
  mkdir -p /mnt/ext4
  fallocate -l 16M fs.img
  mke2fs -t ext4 -O bigalloc,meta_bg,^resize_inode -F fs.img
  debugfs -w -R "ssv first_meta_bg 842150400" fs.img
  mount -o loop fs.img /mnt/ext4

Fix it by validating s_first_meta_bg first at mount time, and
refusing to mount if its value exceeds the largest possible meta_bg
number.

Reported-by: Ralf Spenneberg <ralf@os-t.de>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
(cherry picked from commit 3a4b77cd47bb837b8557595ec7425f281f2ca1fe)

Orabug: 25802481
CVE-2016-10208

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

ext4: clean up feature test macros with predicate functions

Create separate predicate functions to test/set/clear feature flags,
thereby replacing the wordy old macros. Furthermore, clean out the
places where we open-coded feature tests.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
(cherry picked from commit e2b911c53584a92266943f3b7f2cdbc19c1a4e80)

Orabug: 25802481
CVE-2016-10208

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
fs/ext4/namei.c
fs/ext4/super.c

KVM: x86: fix emulation of "MOV SS, null selector"

This is CVE-2017-2583. On Intel this causes a failed vmentry because
SS's type is neither 3 nor 7 (even though the manual says this check is
only done for usable SS, and the dmesg splat says that SS is unusable!).
On AMD it's worse: svm.c is confused and sets CPL to 0 in the vmcb.

The fix fabricates a data segment descriptor when SS is set to a null
selector, so that CPL and SS.DPL are set correctly in the VMCS/vmcb.
Furthermore, only allow setting SS to a NULL selector if SS.RPL < 3;
this in turn ensures CPL < 3 because RPL must be equal to CPL.

Thanks to Andy Lutomirski and Willy Tarreau for help in analyzing
the bug and deciphering the manuals.

Orabug: 25802278
CVE: CVE-2017-2583

Reported-by: Xiaohan Zhang <zhangxiaohan1@huawei.com>
Fixes: 79d5b4c3cd809c770d4bf9812635647016c56011
Cc: stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 33ab91103b3415e12457e3104f0e4517ce12d0f3)

Signed-off-by: Aniket Alshi <aniket.alshi@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>

gfs2: fix slab corruption during mounting and umounting gfs file system

During mounting and unmounting GFS2 file system, kernel panic happens
due to slab memory corruption. The slab allocator suggests that it is
likely a double free memory corrruption. The issue is traced back to
v3.9-rc6 where a patch is submitted to use kzalloc() for storing a
bitmap instead of using a local variable. The intention is to allocate
memory during mounting and to free memory during unmounting. The original
patch misses a code path which has already freed the memory and caused
memory corruption. This patch sets the memory pointer to NULL after
the memory is freed, so that double free memory corruption will not
be happened.

gdlm_mount()
  '-- set_recover_size() which use kzalloc()
  '-- if dlm does not support ops callbacks then
          '--- free_recover_size() which use kfree()

gldm_unmount()
  '-- free_recover_size() which use kfree()

previous patch which introduce the double free issue is
commit 57c7310b8eb9 ("GFS2: use kmalloc for lvb bitmap")

orabug: 25253085
orabug: 25791662

Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>

gfs2: handle NULL rgd in set_rgrp_preferences

The function set_rgrp_preferences() does not handle the (rarely
returned) NULL value from gfs2_rgrpd_get_next() and this patch
fixes that.

The fs image in question is only 150MB in size which allows for
only 1 rgrp to be created. The in-memory rb tree has only 1 node
and when gfs2_rgrpd_get_next() is called on this sole rgrp, it
returns NULL. (Default behavior is to wrap around the rb tree and
return the first node to give the illusion of a circular linked
list. In the case of only 1 rgrp, we can't have
gfs2_rgrpd_get_next() return the same rgrp (first, last, next all
point to the same rgrp)... that would cause unintended consequences
and infinite loops.)

orabug: 25253085
Orabug: 25791662

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
(cherry picked from upstream commit 959b6717175713259664950f3bba2418b038f69a)
Signed-off-by: Thomas Tai <thomas.tai@oracle.com>

Revert "fix minor infoleak in get_user_ex()"

Orabug: 25790370
CVE: CVE-2016-9644

This reverts commit fc2cba1c03dc9da0668d1719a6ad47b39c26b574.

sched/wait: Fix signal handling in bit wait helpers

Vladimir reported getting RCU stall warnings and bisected it back to
commit:

743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions")

That commit inadvertently reversed the calls to schedule() and signal_pending(),
thereby not handling the case where the signal receives while we sleep.

Reported-by: Vladimir Murzin <vladimir.murzin@arm.com>
Tested-by: Vladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: mark.rutland@arm.com
Cc: neilb@suse.de
Cc: oleg@redhat.com
Fixes: 743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions")
Fixes: cbbce8220949 ("SCHED: add some "wait..on_bit...timeout()" interfaces.")
Link: http://lkml.kernel.org/r/20151201130404.GL3816@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 68985633bccb6066bf1803e316fbc6c1f5b796d6)

Orabug: 25416990
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-By: Dan Duval <dan.duval@oracle.com>
Conflicts:
kernel/sched/wait.c

xen-pcifront/hvm: Slurp up "pxm" entry and set NUMA node on PCIe device. (V5)

If the XenBus contains the "pci" (which by default
it does for both PV and HVM guests), then iterate over
all the entries there and see if there are any with "pxm-X"
key. If so those values are used to modify the NUMA locality
information for the PCIe devices that match.

Also support PCIe hotplug - in case this done during runtime.

This patch also depends on the Xen to expose via XenBus the
"pxm-%d" entries.

A bit of background:

_PXM in ACPI is used to tie all kind of ACPI devices to the SRAT
table.

The SRAT table is simple N CPU array that lists APIC IDs and the NUMA nodes
and their distance from each other. There are two types - processor
affinity and memory affinity. For example one can have on a 4 CPU
machine this processor affinity:

APIC_ID | NUMA id (nid)
--------+--------------
0       | 0
2       | 0
4       | 1
6       | 1

The _PXM tie in the NUMA (nid), so for this guest there can only be
two - 0 or 1.

The _PXM can be slapped on most anything in the DSDT, the Processors
(kind of redundant as it is in SRAT), but most importantly for us the
PCIe devices. Except that ACPI does not enumerate all kind of PCIe devices.

Device (PCI0)
        {
            Name (_HID, EisaId ("PNP0A08") /* PCI Express Bus */)  // _HID: Hardware ID
..
            Name (_PXM, Zero)  // _PXM: Device Proximity
        }

  Device (PCI1)
        {
            Name (_HID, EisaId ("PNP0A08") /* PCI Express Bus */)  // _HID: Hardware ID
            Name (_CID, EisaId ("PNP0A03") /* PCI Bus */)  // _CID: Compatible ID
            Name (_PXM, 0x01)  // _PXM: Device Proximity
        }

And this nicely helps with the Linux OS (and Windows) enumerating the
PCIe bridges (the two above) _ONCE_ during bootup. Then when a device
is hotplugged under the bridges it is very clear to which NUMA domain
it belongs.

To recap, on normal hardware Linux scans _ONCE_ the DSDT during
bootup and _only_ evaluates the _PXM on bridges "PNP0A03".

ONCE.

On the QEMU guests that Xen provides we have exactly _one_ bridge.
And the PCIe are hotplugged as 'slots' under it.

The SR-IOV VFs we hot-plug in the guest are done during runtime (not
during bootup, that would be too easy).

This means to make this work we would need to implement in QEMU:
1) Expand piix4 emulation to have bridges, PCIe bridges at bootup.
    And bridges also expose the "window" of what the size of the MMIO
    region is behind it (and the PCIe devices would fit in there).

2). Create up to NUMA node of these PCI bridges with the _PXM
    information.

3). Then during PCI hotplug would decide which bridge based on the
    NUMA locality.

That is hard. The 1) is especially difficult as we have no idea
how big MMIO bar the device plugged in will be!

Fortunatly Intel resolved this with the Intel VT-D. It has a hotplug
capability so you can insert a brand new PCIe bridge at any point.
This is how ThunderBolt works in essence.

This would mean that in QEMU we would need to:
4). Emulate in QEMU an IOMMU VT-d with PCI hotplug support.

Recognizing that 1-4 may take some time, and would need to be
done upstream first I decided to take a bit of shortcut.

Mainly that:
1) This only needs to work for ExaData which uses our kernel (UEK)
2) We already expose some of this information on XenBus.
3) Once upstream is done this can be easily 'dropped'.

In fact, 2) exposes a lot of information:

/libxl/1/device/pci/0 = ""
/libxl/1/device/pci/0/frontend = "/local/domain/1/device/pci/0"
/libxl/1/device/pci/0/backend = "/local/domain/0/backend/pci/1/0"
.. snip..
/libxl/1/device/pci/0/frontend-id = "1"
/libxl/1/device/pci/0/online = "1"
/libxl/1/device/pci/0/state = "1"
/libxl/1/device/pci/0/key-0 = "0000:04:00.0"
/libxl/1/device/pci/0/dev-0 = "0000:04:00.0"
/libxl/1/device/pci/0/vdevfn-0 = "28"
/libxl/1/device/pci/0/opts-0 = "msitranslate=0,power_mgmt=0,permissive=0"
/libxl/1/device/pci/0/state-0 = "1"
/libxl/1/device/pci/0/key-1 = "0000:07:00.0"
/libxl/1/device/pci/0/dev-1 = "0000:07:00.0"
/libxl/1/device/pci/0/vdevfn-1 = "30"
/libxl/1/device/pci/0/opts-1 = "msitranslate=0,power_mgmt=0,permissive=0"
/libxl/1/device/pci/0/state-1 = "1"
/libxl/1/device/pci/0/num_devs = "2"

The 'vdevfn' is the slot:function value. 28 is 00:05.0 and 30
is 00:06:0 and that corresponds to (inside of the guest):
-bash-4.1# lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB Controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Class ff80: XenSource, Inc. Xen Platform Device (rev 01)
00:05.0 USB Controller: NEC Corporation Device 0194 (rev 03)
00:06.0 USB Controller: NEC Corporation Device 0194 (rev 04)

This 'vdevfn' is created by QEMU when the device is hotplugged
(or at bootup time).

So I figured it we have an extra key:

/libxl/1/device/pci/0/pxm-0 = "1"
/libxl/1/device/pci/0/pxm-1 = "1"

We could use that inside of the guest to call 'set_dev_node'
on the PCIe devices.

And in fact, for the above the lspci output we can make this work:

-bash-4.1# find /sys -name local_cpulist
/sys/devices/pci0000:00/0000:00:00.0/local_cpulist
/sys/devices/pci0000:00/0000:00:01.3/local_cpulist
/sys/devices/pci0000:00/0000:00:03.0/local_cpulist
/sys/devices/pci0000:00/0000:00:01.1/local_cpulist
/sys/devices/pci0000:00/0000:00:06.0/local_cpulist <===
/sys/devices/pci0000:00/0000:00:02.0/local_cpulist
/sys/devices/pci0000:00/0000:00:05.0/local_cpulist <===
/sys/devices/pci0000:00/0000:00:01.2/local_cpulist
/sys/devices/pci0000:00/0000:00:01.0/local_cpulist
-bash-4.1# find /sys -name local_cpulist | xargs cat
-3
0-3
0-3
0-3
2-3 <===
0-3
2-3 <===
0-3
0-3

-bash-4.1# find /sys -name cpulist
/sys/devices/system/node/node0/cpulist
/sys/devices/system/node/node1/cpulist
-bash-4.1# find /sys -name cpulist | xargs cat
0-1
2-3

With this guest config:
kernel = "hvmloader"
device_model_version = 'qemu-xen-traditional'
builder='hvm'
memory=1024
serial='pty'
smt=1
vcpus = 4
cpus=['0-3']
name="bootstrap-x86_64-pvhvm"
disk = [ 'file:/mnt/lab/bootstrap-x86_64/root_image.iso,hdc:cdrom,r','phy:/dev/guests/bootstrap-x86_64-pvhvm,xvda,w']
boot="dn"
vif = [ 'mac=00:0F:4B:00:00:68, bridge=switch' ]
vnc=1
vnclisten="0.0.0.0"
usb=1
usbdevice="tablet"
xen_platform_pci=1
vnuma=[ ["pnode=0", "size=512", "vcpus=0,1","vdistances=10,20"],
        ["pnode=0", "size=512", "vcpus=2,3","vdistances=20,10"]]

pci=['04:00.0','07:00.0']

And
During bootup you would see:
[   11.227821] calling  pcifront_init+0x0/0x118 @ 1
[   11.241826] pcifront_hvm pci-0: PCI device 0000:00:05.0 (PXM=1)
[   11.261298] pci 0000:00:05.0: Updating PXM to 1
[   11.273989] initcall pcifront_init+0x0/0x118 returned 0 after 32791 usecs
[   11.274620] pcifront_hvm pci-0: PCI device 0000:00:05.0 (PXM=1)
[   11.276977] pcifront_hvm pci-0: PCI device 0000:00:05.0 (PXM=1)

OraBug: 25788744

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
---
v1: Fixed all the checkpatch.pl issues
       Made it respect the node_online() in case the backend provided
       a larger value than there are NUMA nodes.
v2: Fixed per Boris's reviews.
v3: Added a mechanism to prune the list when devices are removed.
v4: s/l/len
    Added space after 'len' in decleration.
    Fixed comments
    Added Reviewed-by.
v5: Added Boris's Reviewed-by

IB/CORE: sync the resouce access in fmr_pool

orabug: 25677461

There were some problem in the fmr_pool code that either was missing lock
protection or was using wrong lock when allocating/freeing/looking up resource
in the FMR pool.

Covering all above issues, the code turns out that every where we need lock
protection we need both the pool_lock and used_pool_lock. So this patch also
removes the used_pool_lock and keeps the pool lock and make the later sync
all the accesses.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Avinash Repaka <avinash.repaka@oracle.com>

net: ping: check minimum size on ICMP header length

Orabug: 25766884
CVE: CVE-2016-8399

Prior to commit c0371da6047a ("put iov_iter into msghdr") in v3.19, there
was no check that the iovec contained enough bytes for an ICMP header,
and the read loop would walk across neighboring stack contents. Since the
iov_iter conversion, bad arguments are noticed, but the returned error is
EFAULT. Returning EINVAL is a clearer error and also solves the problem
prior to v3.19.

This was found using trinity with KASAN on v3.18:

BUG: KASAN: stack-out-of-bounds in memcpy_fromiovec+0x60/0x114 at addr ffffffc071077da0
Read of size 8 by task trinity-c2/9623
page:ffffffbe034b9a08 count:0 mapcount:0 mapping:          (null) index:0x0
flags: 0x0()
page dumped because: kasan: bad access detected
CPU: 0 PID: 9623 Comm: trinity-c2 Tainted: G    BU         3.18.0-dirty #15
Hardware name: Google Tegra210 Smaug Rev 1,3+ (DT)
Call trace:
[<ffffffc000209c98>] dump_backtrace+0x0/0x1ac arch/arm64/kernel/traps.c:90
[<ffffffc000209e54>] show_stack+0x10/0x1c arch/arm64/kernel/traps.c:171
[<     inline     >] __dump_stack lib/dump_stack.c:15
[<ffffffc000f18dc4>] dump_stack+0x7c/0xd0 lib/dump_stack.c:50
[<     inline     >] print_address_description mm/kasan/report.c:147
[<     inline     >] kasan_report_error mm/kasan/report.c:236
[<ffffffc000373dcc>] kasan_report+0x380/0x4b8 mm/kasan/report.c:259
[<     inline     >] check_memory_region mm/kasan/kasan.c:264
[<ffffffc00037352c>] __asan_load8+0x20/0x70 mm/kasan/kasan.c:507
[<ffffffc0005b9624>] memcpy_fromiovec+0x5c/0x114 lib/iovec.c:15
[<     inline     >] memcpy_from_msg include/linux/skbuff.h:2667
[<ffffffc000ddeba0>] ping_common_sendmsg+0x50/0x108 net/ipv4/ping.c:674
[<ffffffc000dded30>] ping_v4_sendmsg+0xd8/0x698 net/ipv4/ping.c:714
[<ffffffc000dc91dc>] inet_sendmsg+0xe0/0x12c net/ipv4/af_inet.c:749
[<     inline     >] __sock_sendmsg_nosec net/socket.c:624
[<     inline     >] __sock_sendmsg net/socket.c:632
[<ffffffc000cab61c>] sock_sendmsg+0x124/0x164 net/socket.c:643
[<     inline     >] SYSC_sendto net/socket.c:1797
[<ffffffc000cad270>] SyS_sendto+0x178/0x1d8 net/socket.c:1761

CVE-2016-8399

Reported-by: Qidan He <i@flanker017.me>
Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 0eab121ef8750a5c8637d51534d5e9143fb0633f)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: sg: check length passed to SG_NEXT_CMD_LEN

Orabug: 25751395
CVE: CVE-2017-7187

The user can control the size of the next command passed along, but the
value passed to the ioctl isn't checked against the usable max command
size.

Cc: <stable@vger.kernel.org>
Signed-off-by: Peter Chang <dpf@google.com>
Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen-netfront: Rework the fix for Rx stall during OOM and network stress

Orabug: 25747721

The commit 90c311b0eeea ("xen-netfront: Fix Rx stall during network
stress and OOM") caused the refill timer to be triggerred almost on
all invocations of xennet_alloc_rx_buffers for certain workloads.
This reworks the fix by reverting to the old behaviour and taking into
consideration the skb allocation failure. Refill timer is now triggered
on insufficient requests or skb allocation failure.

Signed-off-by: Vineeth Remanan Pillai <vineethp@amazon.com>
Fixes: 90c311b0eeea (xen-netfront: Fix Rx stall during network stress and OOM)
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Backport from upstream 538d92912d3190a1dd809233a0d57277459f37b2

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Acked-By: Joe Jin <joe.jin@oracle.com>

xen-netfront: Fix Rx stall during network stress and OOM

Orabug: 25747721

During an OOM scenario, request slots could not be created as skb
allocation fails. So the netback cannot pass in packets and netfront
wrongly assumes that there is no more work to be done and it disables
polling. This causes Rx to stall.

The issue is with the retry logic which schedules the timer if the
created slots are less than NET_RX_SLOTS_MIN. The count of new request
slots to be pushed are calculated as a difference between new req_prod
and rsp_cons which could be more than the actual slots, if there are
unconsumed responses.

The fix is to calculate the count of newly created slots as the
difference between new req_prod and old req_prod.

Signed-off-by: Vineeth Remanan Pillai <vineethp@amazon.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Backport from upstream 90c311b0eeead647b708a723dbdde1eda3dcad05

Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Acked-By: Joe Jin <joe.jin@oracle.com>

ipc/shm: Fix shmat mmap nil-page protection

The issue is described here, with a nice testcase:

    https://bugzilla.kernel.org/show_bug.cgi?id=192931

The problem is that shmat() calls do_mmap_pgoff() with MAP_FIXED, and
the address rounded down to 0.  For the regular mmap case, the
protection mentioned above is that the kernel gets to generate the
address -- arch_get_unmapped_area() will always check for MAP_FIXED and
return that address.  So by the time we do security_mmap_addr(0) things
get funky for shmat().

The testcase itself shows that while a regular user crashes, root will
not have a problem attaching a nil-page.  There are two possible fixes
to this.  The first, and which this patch does, is to simply allow root
to crash as well -- this is also regular mmap behavior, ie when hacking
up the testcase and adding mmap(...  |MAP_FIXED).  While this approach
is the safer option, the second alternative is to ignore SHM_RND if the
rounded address is 0, thus only having MAP_SHARED flags.  This makes the
behavior of shmat() identical to the mmap() case.  The downside of this
is obviously user visible, but does make sense in that it maintains
semantics after the round-down wrt 0 address and mmap.

Passes shm related ltp tests.

Link: http://lkml.kernel.org/r/1486050195-18629-1-git-send-email-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reported-by: Gareth Evans <gareth.evans@contextis.co.uk>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 95e91b831f87ac8e1f8ed50c14d709089b4e01b8)

Orabug: 25717094
CVE-2017-5669

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>

Merge branch 'topic/uek-4.1/dtrace' into uek/uek-next

* topic/uek-4.1/dtrace:
dtrace: improve io provider coverage

sg_write()/bsg_write() is not fit to be called under KERNEL_DS

Orabug: 25340071
CVE: CVE-2016-10088

Both damn things interpret userland pointers embedded into the payload;
worse, they are actually traversing those. Leaving aside the bad
API design, this is very much _not_ safe to call with KERNEL_DS.
Bail out early if that happens.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 128394eff343fc6d2f32172f03e24829539c5835)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

tcp: fix potential memory corruption

Imagine initial value of max_skb_frags is 17, and last
skb in write queue has 15 frags.

Then max_skb_frags is lowered to 14 or smaller value.

tcp_sendmsg() will then be allowed to add additional page frags
and eventually go past MAX_SKB_FRAGS, overflowing struct
skb_shared_info.

Orabug: 25140382

Fixes: 5f74f82ea34c ("net:Add sysctl_max_skb_frags")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Cc: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit ac9e70b17ecd7c6e933ff2eaf7ab37429e71bf4d)

Reviewed-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>

block: fix use-after-free in seq file

Orabug: 25134541
CVE: CVE-2016-7910

I got a KASAN report of use-after-free:

    ==================================================================
    BUG: KASAN: use-after-free in klist_iter_exit+0x61/0x70 at addr ffff8800b6581508
    Read of size 8 by task trinity-c1/315
    =============================================================================
    BUG kmalloc-32 (Not tainted): kasan: bad access detected
    -----------------------------------------------------------------------------

    Disabling lock debugging due to kernel taint
    INFO: Allocated in disk_seqf_start+0x66/0x110 age=144 cpu=1 pid=315
            ___slab_alloc+0x4f1/0x520
            __slab_alloc.isra.58+0x56/0x80
            kmem_cache_alloc_trace+0x260/0x2a0
            disk_seqf_start+0x66/0x110
            traverse+0x176/0x860
            seq_read+0x7e3/0x11a0
            proc_reg_read+0xbc/0x180
            do_loop_readv_writev+0x134/0x210
            do_readv_writev+0x565/0x660
            vfs_readv+0x67/0xa0
            do_preadv+0x126/0x170
            SyS_preadv+0xc/0x10
            do_syscall_64+0x1a1/0x460
            return_from_SYSCALL_64+0x0/0x6a
    INFO: Freed in disk_seqf_stop+0x42/0x50 age=160 cpu=1 pid=315
            __slab_free+0x17a/0x2c0
            kfree+0x20a/0x220
            disk_seqf_stop+0x42/0x50
            traverse+0x3b5/0x860
            seq_read+0x7e3/0x11a0
            proc_reg_read+0xbc/0x180
            do_loop_readv_writev+0x134/0x210
            do_readv_writev+0x565/0x660
            vfs_readv+0x67/0xa0
            do_preadv+0x126/0x170
            SyS_preadv+0xc/0x10
            do_syscall_64+0x1a1/0x460
            return_from_SYSCALL_64+0x0/0x6a

    CPU: 1 PID: 315 Comm: trinity-c1 Tainted: G    B           4.7.0+ #62
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
     ffffea0002d96000 ffff880119b9f918 ffffffff81d6ce81 ffff88011a804480
     ffff8800b6581500 ffff880119b9f948 ffffffff8146c7bd ffff88011a804480
     ffffea0002d96000 ffff8800b6581500 fffffffffffffff4 ffff880119b9f970
    Call Trace:
     [<ffffffff81d6ce81>] dump_stack+0x65/0x84
     [<ffffffff8146c7bd>] print_trailer+0x10d/0x1a0
     [<ffffffff814704ff>] object_err+0x2f/0x40
     [<ffffffff814754d1>] kasan_report_error+0x221/0x520
     [<ffffffff8147590e>] __asan_report_load8_noabort+0x3e/0x40
     [<ffffffff83888161>] klist_iter_exit+0x61/0x70
     [<ffffffff82404389>] class_dev_iter_exit+0x9/0x10
     [<ffffffff81d2e8ea>] disk_seqf_stop+0x3a/0x50
     [<ffffffff8151f812>] seq_read+0x4b2/0x11a0
     [<ffffffff815f8fdc>] proc_reg_read+0xbc/0x180
     [<ffffffff814b24e4>] do_loop_readv_writev+0x134/0x210
     [<ffffffff814b4c45>] do_readv_writev+0x565/0x660
     [<ffffffff814b8a17>] vfs_readv+0x67/0xa0
     [<ffffffff814b8de6>] do_preadv+0x126/0x170
     [<ffffffff814b92ec>] SyS_preadv+0xc/0x10

This problem can occur in the following situation:

open()
- pread()
    - .seq_start()
       - iter = kmalloc() // succeeds
       - seqf->private = iter
    - .seq_stop()
       - kfree(seqf->private)
- pread()
    - .seq_start()
       - iter = kmalloc() // fails
    - .seq_stop()
       - class_dev_iter_exit(seqf->private) // boom! old pointer

As the comment in disk_seqf_stop() says, stop is called even if start
failed, so we need to reinitialise the private pointer to NULL when seq
iteration stops.

An alternative would be to set the private pointer to NULL when the
kmalloc() in disk_seqf_start() fails.

Cc: stable@vger.kernel.org
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 77da160530dd1dc94f6ae15a981f24e5f0021e84)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xfs: Correctly lock inode when removing suid and file capabilities

From a6de82cab123beaf9406024943caa0242f0618b0 Mon Sep 17 00:00:00 2001

Currently XFS calls file_remove_privs() without holding i_mutex. This is
wrong because that function can end up messing with file permissions and
file capabilities stored in xattrs for which we need i_mutex held.

Fix the problem by grabbing iolock exclusively when we will need to
change anything in permissions / xattrs.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Orabug: 24803533
Signed-off-by: darrick.wong@oracle.com

fs: Call security_ops->inode_killpriv on truncate

From 45f147a1bc97c743c6101a8d2741c69a51f583e4 Mon Sep 17 00:00:00 2001

Comment in include/linux/security.h says that ->inode_killpriv() should
be called when setuid bit is being removed and that similar security
labels (in fact this applies only to file capabilities) should be
removed at this time as well. However we don't call ->inode_killpriv()
when we remove suid bit on truncate.

We fix the problem by calling ->inode_need_killpriv() and subsequently
->inode_killpriv() on truncate the same way as we do it on file write.

After this patch there's only one user of should_remove_suid() - ocfs2 -
and indeed it's buggy because it doesn't call ->inode_killpriv() on
write. However fixing it is difficult because of special locking
constraints.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Orabug: 24803533
Signed-off-by: darrick.wong@oracle.com

fs: Provide function telling whether file_remove_privs() will do anything

From dbfae0cdcd87602737101d4417811f4323156b54 Mon Sep 17 00:00:00 2001

Provide function telling whether file_remove_privs() will do anything.
Currently we only have should_remove_suid() and that does something
slightly different.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Orabug: 24803533
Signed-off-by: darrick.wong@oracle.com

fs: Rename file_remove_suid() to file_remove_privs()

From 5fa8e0a1c6a762857ae67d1628c58b9a02362003 Mon Sep 17 00:00:00 2001

file_remove_suid() is a misnomer since it removes also file capabilities
stored in xattrs and sets S_NOSEC flag. Also should_remove_suid() tells
something else than whether file_remove_suid() call is necessary which
leads to bugs.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Orabug: 24803533
Signed-off-by: darrick.wong@oracle.com

IB/uverbs: Fix leak of XRC target QPs

The real QP is destroyed in case of the ref count reaches zero, but
for XRC target QPs this call was missed and caused to QP leaks.

Let's call to destroy for all flows.

Orabug: 24761732

Fixes: 0e0ec7e0638e ('RDMA/core: Export ib_open_qp() to share XRC...')
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Noa Osherovich <noaos@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit 5b810a242c28e1d8d64d718cebe75b79d86a0b2d)
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>

Some unsupported ioctls get logged unnecessarily

IPoIB logs messages such as "ib0: ioctl fail to copy request data".

Orabug: 24510137

Acked-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>

IB/ipoib: Expose acl_enable sysfs file as read only

This file can be used to determine if ipoib supports IB-ACL.
In debug mode all sysfs files are exposed in full mode.
In non-debug mode only acl_enable is exposed but in real only mode.

Orabug: 25993951

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Mukesh Kacker <mukesh.kacker@oracle.com>

Merge branch 'topic/uek-4.1/dtrace' into uek/uek-next

* uek/uek-4.1-next-ndroux-v4.1.12-99.bug25816537#rc3:
dtrace: proc:::exit should trigger only if thread group exits

Merge branch 'topic/uek-4.1/dtrace' up to bug 26137220 into uek/uek-next

* topic/uek-4.1/dtrace:
  ctf: prevent modules on the dedup blacklist from sharing any types at all
  ctf: emit bitfields in in-memory order
  ctf: bitfield support
  ctf: emit file-scope static variables
  ctf: speed up the dwarf2ctf duplicate detector some more
  ctf: strdup() -> xstrdup()
  ctf: speed up the dwarf2ctf duplicate detector
  ctf: add module parameter to simple_dwfl_new() and adjust both callers
  ctf: fix the size of int and avoid duplicating it
  ctf: allow overriding of DIE attributes: use it for parent bias
  DTrace tcp/udp provider probes
  dtrace: define DTRACE_PROBE_ENABLED to 0 when !CONFIG_DTRACE
  dtrace: ensure limit is enforced even when pcs is NULL
  dtrace: make x86_64 FBT return probe detection less restrictive
  dtrace: support passing offset as arg0 to FBT return probes
  dtrace: make FBT entry probe detection less restrictive on x86_64
  dtrace: adjust FBT entry probe dection for OL7

ol7/config: enable nf_tables packet duplication support

Orabug: 24694570

For SPARC & X86_64
Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>

netfilter: nf_dup: add missing dependencies with NF_CONNTRACK

CONFIG_NF_CONNTRACK=m
CONFIG_NF_DUP_IPV4=y

results in:

net/built-in.o: In function `nf_dup_ipv4':
>> (.text+0xd434f): undefined reference to `nf_conntrack_untracked'

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit d3340b79ec8222d20453b1e7f261b017d1d09dc9)

Orabug: 24694570

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>

netfilter: nf_tables: add nft_dup expression

This new expression uses the nf_dup engine to clone packets to a given gateway.
Unlike xt_TEE, we use an index to indicate output interface which should be
fine at this stage.

Moreover, change to the preemtion-safe this_cpu_read(nf_skb_duplicated) from
nf_dup_ipv{4,6} to silence a lockdep splat.

Based on the original tee expression from Arturo Borrero Gonzalez, although
this patch has diverted quite a bit from this initial effort due to the
change to support maps.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit d877f07112f1e5a247c6b585c971a93895c9f738)

Orabug: 24694570

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>

netfilter: factor out packet duplication for IPv4/IPv6

Extracted from the xtables TEE target. This creates two new modules for IPv4
and IPv6 that are shared between the TEE target and the new nf_tables dup
expressions.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit bbde9fc1824aab58bc78c084163007dd6c03fe5b)

Orabug: 24694570

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>

netfilter: xt_TEE: get rid of WITH_CONNTRACK definition

Use IS_ENABLED(CONFIG_NF_CONNTRACK) instead.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit 24b7811fa5de7bbbab3b782a38889034ea267f70)

Orabug: 24694570

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>

netfilter: move tee_active to core

This prepares for a TEE like expression in nftables.
We want to ensure only one duplicate is sent, so both will
use the same percpu variable to detect duplication.

The other use case is detection of recursive call to xtables, but since
we don't want dependency from nft to xtables core its put into core.c
instead of the x_tables core.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit e7c8899f3e6f2830136cf6e115c4a55ce7a3920a)

Orabug: 24694570

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>

ipv6: Set FLOWI_FLAG_KNOWN_NH at flowi6_flags

The neighbor look-up used to depend on the rt6i_gateway (if
there is a gateway) or the rt6i_dst (if it is a RTF_CACHE clone)
as the nexthop address. Note that rt6i_dst is set to fl6->daddr
for the RTF_CACHE clone where fl6->daddr is the one used to do
the route look-up.

Now, we only create RTF_CACHE clone after encountering exception.
When doing the neighbor look-up with a route that is neither a gateway
nor a RTF_CACHE clone, the daddr in skb will be used as the nexthop.

In some cases, the daddr in skb is not the one used to do
the route look-up. One example is in ip_vs_dr_xmit_v6() where the
real nexthop server address is different from the one in the skb.

This patch is going to follow the IPv4 approach and ask the
ip6_pol_route() callers to set the FLOWI_FLAG_KNOWN_NH properly.

In the next patch, ip6_pol_route() will honor the FLOWI_FLAG_KNOWN_NH
and create a RTF_CACHE clone.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Tested-by: Julian Anastasov <ja@ssi.bg>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 48e8aa6e3137692d38f20e8bfff100e408c6bc53)

Orabug: 24694570

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>

ext4: Fix data exposure after failed AIO DIO

When AIO DIO fails e.g. due to IO error, we must not convert unwritten
extents as that will expose uninitialized data. Handle this case
by clearing unwritten flag from io_end in case of error and thus
preventing extent conversion.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry-picked commit from 74c66bcb7eda551f3b8588659c58fe29184af903)

Orabug: 24393811

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

xfs: fold xfs_vm_do_dio into xfs_vm_direct_IO

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry-picked commit from c19b104a67b3bb1ac48275a8a1c9df666e676c25)

Orabug: 24393811

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

xfs: don't use ioends for direct write completions

We only need to communicate two bits of information to the direct I/O
completion handler:

(1) do we need to convert any unwritten extents in the range
(2) do we need to check if we need to update the inode size based
     on the range passed to the completion handler

We can use the private data passed to the get_block handler and the
completion handler as a simple bitmask to communicate this information
instead of the current complicated infrastructure reusing the ioends
from the buffer I/O path, and thus avoiding a memory allocation and
a context switch for any non-trivial direct write.  As a nice side
effect we also decouple the direct I/O path implementation from that
of the buffered I/O path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
(cherry-picked commit from 273dda76f757108bc2b29d30a9595b6dd3bdf3a1)

Orabug: 24393811
Conflicts:
     Fixed a merge conflict in xfs_trace.h arised due to absence of
xfs_zero_eof in UEK4.

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

direct-io: always call ->end_io if non-NULL

This way we can pass back errors to the file system, and allow for
cleanup required for all direct I/O invocations.

Also allow the ->end_io handlers to return errors on their own, so that
I/O completion errors can be passed on to the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry-picked commit from 187372a3b9faff68ed61c291d0135e6739e0dbdf)

Orabug: 24393811
Conflicts:
The change in the return type of the function dio_iodone_t
broke the KABI. Hence, the original function return type is wrapped
around under the flag __GENKSYMS__

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

Btrfs: send, fix failure to rename top level inode due to name collision

Orabug: 25994280

Under certain situations, an incremental send operation can fail due to a
premature attempt to create a new top level inode (a direct child of the
subvolume/snapshot root) whose name collides with another inode that was
removed from the send snapshot.

Consider the following example scenario.

Parent snapshot:

  .                 (ino 256, gen 8)
  |---- a1/         (ino 257, gen 9)
  |---- a2/         (ino 258, gen 9)

Send snapshot:

  .                 (ino 256, gen 3)
  |---- a2/         (ino 257, gen 7)

In this scenario, when receiving the incremental send stream, the btrfs
receive command fails like this (ran in verbose mode, -vv argument):

  rmdir a1
  mkfile o257-7-0
  rename o257-7-0 -> a2
  ERROR: rename o257-7-0 -> a2 failed: Is a directory

What happens when computing the incremental send stream is:

1) An operation to remove the directory with inode number 257 and
   generation 9 is issued.

2) An operation to create the inode with number 257 and generation 7 is
   issued. This creates the inode with an orphanized name of "o257-7-0".

3) An operation rename the new inode 257 to its final name, "a2", is
   issued. This is incorrect because inode 258, which has the same name
   and it's a child of the same parent (root inode 256), was not yet
   processed and therefore no rmdir operation for it was yet issued.
   The rename operation is issued because we fail to detect that the
   name of the new inode 257 collides with inode 258, because their
   parent, a subvolume/snapshot root (inode 256) has a different
   generation in both snapshots.

So fix this by ignoring the generation value of a parent directory that
matches a root inode (number 256) when we are checking if the name of the
inode currently being processed collides with the name of some other
inode that was not yet processed.

We can achieve this scenario of different inodes with the same number but
different generation values either by mounting a filesystem with the inode
cache option (-o inode_cache) or by creating and sending snapshots across
different filesystems, like in the following example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt
  $ mkdir /mnt/a1
  $ mkdir /mnt/a2
  $ btrfs subvolume snapshot -r /mnt /mnt/snap1
  $ btrfs send /mnt/snap1 -f /tmp/1.snap
  $ umount /mnt

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt
  $ touch /mnt/a2
  $ btrfs subvolume snapshot -r /mnt /mnt/snap2
  $ btrfs receive /mnt -f /tmp/1.snap
  # Take note that once the filesystem is created, its current
  # generation has value 7 so the inode from the second snapshot has
  # a generation value of 7. And after receiving the first snapshot
  # the filesystem is at a generation value of 10, because the call to
  # create the second snapshot bumps the generation to 8 (the snapshot
  # creation ioctl does a transaction commit), the receive command calls
  # the snapshot creation ioctl to create the first snapshot, which bumps
  # the filesystem's generation to 9, and finally when the receive
  # operation finishes it calls an ioctl to transition the first snapshot
  # (snap1) from RW mode to RO mode, which does another transaction commit
  # and bumps the filesystem's generation to 10.
  $ rm -f /tmp/1.snap
  $ btrfs send /mnt/snap1 -f /tmp/1.snap
  $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
  $ umount /mnt

  $ mkfs.btrfs -f /dev/sdd
  $ mount /dev/sdd /mnt
  $ btrfs receive /mnt /tmp/1.snap
  # Receive of snapshot snap2 used to fail.
  $ btrfs receive /mnt /tmp/2.snap

Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Rewrote changelog to be more precise and clear]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit 4dd9920d991745c4a16f53a8f615f706fbe4b3f7)
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Shan Hai <shan.hai@oracle.com>

PCI: Check pref compatible bit for mem64 resource of PCIe device

We still get "no compatible bridge window" warning on sparc T5-8
after we add support for 64bit resource parsing for root bus.

PCI: scan_bus[/pci@300/pci@1/pci@0/pci@6] bus no 8
PCI: Claiming 0000:00:01.0: Resource 15: 0000800100000000..00008004afffffff [220c]
PCI: Claiming 0000:01:00.0: Resource 15: 0000800100000000..00008004afffffff [220c]
PCI: Claiming 0000:02:04.0: Resource 15: 0000800100000000..000080012fffffff [220c]
PCI: Claiming 0000:03:00.0: Resource 15: 0000800100000000..000080012fffffff [220c]
PCI: Claiming 0000:04:06.0: Resource 14: 0000800100000000..000080010fffffff [220c]
PCI: Claiming 0000:05:00.0: Resource 0: 0000800100000000..0000800100001fff [204]
pci 0000:05:00.0: can't claim BAR 0 [mem 0x800100000000-0x800100001fff]: no compatible bridge window

All the bridges 64-bit resource have pref bit, but the device resource does not
have pref set, then we can not find parent for the device resource,
as we can not put non-pref mmio under pref mmio.

According to pcie spec errta
https://www.pcisig.com/specifications/pciexpress/base2/PCIe_Base_r2.1_Errata_08Jun10.pdf
page 13, in some case it is ok to mark some as pref.

Mark if the entire path from the host to the adapter is over PCI Express.
Set pref compatible bit for claim/sizing/assign for 64bit mem resource
on that pcie device.

-v2: set pref for mmio 64 when whole path is PCI Express, according to David Miller.
-v3: don't set pref directly, change to UNDER_PREF, and set PREF before
sizing and assign resource, and cleart PREF afterwards. requested by BenH.
-v4: use on_all_pcie_path device flag instead.
-v6: update after pci_find_bus_resource() change

Link: http://lkml.kernel.org/r/CAE9FiQU1gJY1LYrxs+ma5LCTEEe4xmtjRG0aXJ9K_Tsu+m9Wuw@mail.gmail.com
Reported-by: David Ahern <david.ahern@oracle.com>
Tested-by: David Ahern <david.ahern@oracle.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=81431
Tested-by: TJ <linux@iam.tj>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Tested-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: sparclinux@vger.kernel.org
Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 44e098ad7113adb109fa2d95a29fe2ba9a846efd)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

OF/PCI: Add IORESOURCE_MEM_64 for 64-bit resource

For device resource PREF bit setting under bridge 64-bit pref resource,
we need to make sure only set PREF for 64bit resource.

This patch set IORESOUCE_MEM_64 for 64bit resource during OF device resource
flags parsing.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=96261
Link: https://bugzilla.kernel.org/show_bug.cgi?id=96241
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: devicetree@vger.kernel.org
Tested-by: Khalid Aziz <khalid.aziz@oracle.com>
Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 174fdc115fe527cdcf6b4b99f1e8e6ca4f8062ce)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc/PCI: Keep resource idx order with bridge register number

On one system found strange "no compatible bridge window" warning
even we already had pref_compat support that add extra pref bit for device
resource.

PCI: Claiming 0000:00:01.0: Resource 14: 0002000100000000..000200010fffffff [10220c]
PCI: Claiming 0000:01:00.0: Resource 1: 0002000100000000..000200010000ffff [100214]
pci 0000:01:00.0: can't claim BAR 1 [mem 0x2000100000000-0x200010000ffff 64bit]: no compatible bridge window

It turns out that pci_resource_compatible()/pci_up_path_over_pref_mem64()
just check resource with bridge pref mmio register idx 15, and we have put
resource to use mmio register idx 14 during of_scan_pci_bridge()
as the bridge does not have mmio resource.

We already fix pci_up_path_over_pref_mem64() to check all bus resources.

And at the same time, this patch make resource to have consistent sequence
like other arch or directly from pci_read_bridge_bases(),
even when non-pref mmio is missing, or out of ordering in firmware reporting.

Just hold i = 1 for non pref mmio, and i = 2 for pref mmio.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Tested-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: sparclinux@vger.kernel.org
Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 71c1871c22891102da6303d09b46184361d5f853)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc/PCI: Add IORESOURCE_MEM_64 for 64-bit resource in OF parsing

For device resource with PREF bit setting under bridge 64-bit pref resource,
we need to make sure only set PREF for 64bit resource.

so this patch set IORESOUCE_MEM_64 for 64bit resource during OF device resource
flags parsing.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=96261
Link: https://bugzilla.kernel.org/show_bug.cgi?id=96241
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: sparclinux@vger.kernel.org
Tested-by: Khalid Aziz <khalid.aziz@oracle.com>
Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 47b22604dceb59dfc1018ebd0b8def065daa27db)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc/PCI: Reserve legacy mmio after PCI mmio

On one system found bunch of claim resource fail from pci device.
pci_sun4v f02b894c: PCI host bridge to bus 0000:00
pci_bus 0000:00: root bus resource [io 0x2007e00000000-0x2007e0fffffff] (bus address [0x0000-0xfffffff])
pci_bus 0000:00: root bus resource [mem 0x2000000000000-0x200007effffff] (bus address [0x00000000-0x7effffff])
pci_bus 0000:00: root bus resource [mem 0x2000100000000-0x20007ffffffff] (bus address [0x100000000-0x7ffffffff])
...
PCI: Claiming 0000:00:02.0: Resource 14: 0002000000000000..00020000004fffff [200]
pci 0000:00:02.0: can't claim BAR 14 [mem 0x2000000000000-0x20000004fffff]: address conflict with Video RAM area [??? 0x20000000a0000-0x20000000bffff flags 0x80000000]
pci 0000:02:00.0: can't claim BAR 0 [mem 0x2000000000000-0x20000000fffff]: no compatible bridge window
PCI: Claiming 0000:02:00.0: Resource 3: 0002000000100000..0002000000103fff [200]
pci 0000:02:00.0: can't claim BAR 3 [mem 0x2000000100000-0x2000000103fff]: no compatible bridge window
PCI: Claiming 0000:02:00.1: Resource 0: 0002000000200000..00020000002fffff [200]
pci 0000:02:00.1: can't claim BAR 0 [mem 0x2000000200000-0x20000002fffff]: no compatible bridge window
PCI: Claiming 0000:02:00.1: Resource 3: 0002000000104000..0002000000107fff [200]
pci 0000:02:00.1: can't claim BAR 3 [mem 0x2000000104000-0x2000000107fff]: no compatible bridge window
PCI: Claiming 0000:02:00.2: Resource 0: 0002000000300000..00020000003fffff [200]
pci 0000:02:00.2: can't claim BAR 0 [mem 0x2000000300000-0x20000003fffff]: no compatible bridge window
PCI: Claiming 0000:02:00.2: Resource 3: 0002000000108000..000200000010bfff [200]
pci 0000:02:00.2: can't claim BAR 3 [mem 0x2000000108000-0x200000010bfff]: no compatible bridge window
PCI: Claiming 0000:02:00.3: Resource 0: 0002000000400000..00020000004fffff [200]
pci 0000:02:00.3: can't claim BAR 0 [mem 0x2000000400000-0x20000004fffff]: no compatible bridge window
PCI: Claiming 0000:02:00.3: Resource 3: 000200000010c000..000200000010ffff [200]
pci 0000:02:00.3: can't claim BAR 3 [mem 0x200000010c000-0x200000010ffff]: no compatible bridge window

The bridge 00:02.0 resource does not get reserved as Video RAM take the position early,
and following children resources reservation all fail.

Move down Video RAM area reservation after pci mmio get reserved,
so we leave pci driver to use those regions.

-v5: merge simplify one and use pcibios_bus_to_resource()

-v6: use pci_find_bus_resource()

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Tested-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: sparclinux@vger.kernel.org
Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit bb2c8f32be84cbeec0dd585481c637657e73b05e)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

PCI: Add pci_find_bus_resource()

Add pci_find_bus_resource() to return bus resource for input resource.

In some case, we may only have bus instead of dev.
It is same as pci_find_parent_resource, but take bus as input.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 7f93fc77f440039967ee5a057d49644526a7879f)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc/PCI: Use correct offset for bus address to resource

After we added 64bit mmio parsing, we got some "no compatible bridge window"
warning on anther new model that support 64bit resource.

It turns out that we can not use mem_space.start as 64bit mem space
offset, aka there is mem_space.start != offset.

Use child_phys_addr to calculate exact offset and record offset in
pbm.

After patch we get correct offset.

/pci@305: PCI IO [io  0x2007e00000000-0x2007e0fffffff] offset 2007e00000000
/pci@305: PCI MEM [mem 0x2000000100000-0x200007effffff] offset 2000000000000
/pci@305: PCI MEM64 [mem 0x2000100000000-0x2000dffffffff] offset 2000000000000
...
pci_sun4v f02ae7f8: PCI host bridge to bus 0000:00
pci_bus 0000:00: root bus resource [io  0x2007e00000000-0x2007e0fffffff] (bus address [0x0000-0xfffffff])
pci_bus 0000:00: root bus resource [mem 0x2000000100000-0x200007effffff] (bus address [0x00100000-0x7effffff])
pci_bus 0000:00: root bus resource [mem 0x2000100000000-0x2000dffffffff] (bus address [0x100000000-0xdffffffff])

-v3: put back mem64_offset, as we found T4 has mem_offset != mem64_offset
     check overlapping between mem64_space and mem_space.
-v7: after new pci_mmap_page_range patches.
-v8: remove change in pci_resource_to_user()

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Tested-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: sparclinux@vger.kernel.org
Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit f296714da83b75783997f8dcfe2a9021ef8fedde)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

PCI: Remove __pci_mmap_make_offset()

After
PCI: Let pci_mmap_page_range() take resource address
No user for __pci_mmap_make_offset in those arch.

Remove them.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Cc: linux-xtensa@linux-xtensa.org
Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit c34a8c61ea58d516fb20c6ec4fdf338f96fcfeef)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

PCI: Let pci_mmap_page_range() take resource address

Original pci_mmap_page_range() is taking PCI BAR value aka usr_address.

Bjorn found out that it would be much simple to pass resource address
directly and avoid extra those __pci_mmap_make_offset.

In this patch:
1. in proc path: proc_bus_pci_mmap, try convert back to resource
   before calling pci_mmap_page_range
2. in sysfs path: pci_mmap_resource will just offset with resource start.
3. all pci_mmap_page_range will have vma->vm_pgoff with in resource
   range instead of BAR value.
4. skip calling __pci_mmap_make_offset, as the checking is done
   in pci_mmap_fits().

-v2: add pci_user_to_resource and remove __pci_mmap_make_offset
-v3: pass resource pointer with pci_mmap_page_range()
-v4: put __pci_mmap_make_offset() removing to following patch
     seperate /sys io access alignment checking to another patch
     updated after Bjorn's pci_resource_to_user() changes.
-v5: update after fix for pci_mmap with proc path accoring to
     Bjorn.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Cc: linux-xtensa@linux-xtensa.org
Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit d6ccd78899c966eec1bac726f96f6cbed5b9960e)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

PCI: Fix proc mmap on sparc

In 8c05cd08a7 ("PCI: fix offset check for sysfs mmapped files"), try
to check exposed value with resource start/end in proc mmap path.

|        start = vma->vm_pgoff;
|        size = ((pci_resource_len(pdev, resno) - 1) >> PAGE_SHIFT) + 1;
|        pci_start = (mmap_api == PCI_MMAP_PROCFS) ?
|                        pci_resource_start(pdev, resno) >> PAGE_SHIFT : 0;
|        if (start >= pci_start && start < pci_start + size &&
|                        start + nr <= pci_start + size)

vma->vm_pgoff is above code segment is user/BAR value >> PAGE_SHIFT.
pci_start is resource->start >> PAGE_SHIFT.

For sparc, resource start is different from BAR start aka pci bus address.
pci bus address need to add offset to be the resource start.

So that commit breaks all arch that exposed value is BAR/user value,
and need to be offseted to resource address.

test code using: ./test_mmap_proc /proc/bus/pci/0000:00/04.0 0x2000000
test code segment:
  fd = open(argv[1], O_RDONLY);
  ...
  sscanf(argv[2], "0x%lx", &offset);
  left = offset & (PAGE_SIZE - 1);
  offset &= PAGE_MASK;
  ioctl(fd, PCIIOC_MMAP_IS_MEM);
  addr = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, fd, offset);
  for (i = 0; i < 8; i++)
    printf("%x ", addr[i + left]);
  munmap(addr, PAGE_SIZE);
  close(fd);

Fixes: 8c05cd08a7 ("PCI: fix offset check for sysfs mmapped files")
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 7bd8ad7855e6ecdac36c6dc7946f5b6beff13ae2)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

PCI: Supply CPU physical address (not bus address) to iomem_is_exclusive()

iomem_is_exclusive() requires a CPU physical address, but on some arches we
supplied a PCI bus address instead.

On most arches, pci_resource_to_user(res) returns "res->start", which is a
CPU physical address. But on microblaze, mips, powerpc, and sparc, it
returns the PCI bus address corresponding to "res->start".

The result is that pci_mmap_resource() may fail when it shouldn't (if the
bus address happens to match an existing resource), or it may succeed when
it should fail (if the resource is exclusive but the bus address doesn't
match it).

Call iomem_is_exclusive() with "res->start", which is always a CPU physical
address, not the result of pci_resource_to_user().

Fixes: e8de1481fd71 ("resource: allow MMIO exclusivity for device drivers")
Suggested-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
CC: Arjan van de Ven <arjan@linux.intel.com>
Orabug: 22855133
(cherry picked from commit ca620723d4ff9ea7ed484eab46264c3af871b9ae)
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit aee594b88bfcb58f51e0070de06d3879b3ea4609)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "sparc/PCI: Use correct bus address to resource offset"

This reverts commit dae731c1974171f6c85a7efe018625b72edfa82a. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 2dc2632b0b252f8f50c276efa9cc4f00c8e067b5)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "sparc/PCI: Unify pci_register_region()"

This reverts commit 51c0ecdf02fec6728bb1ecc674ee1b7fe90c9d69. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 934fffe0fdf1bf81c90be604827f19c4ede839fb)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "sparc/PCI: Reserve legacy mmio after PCI mmio"

This reverts commit 0c1736c25a4e157cc31f4bed645fded815b4db97. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 59860a19cf9b2acd24a7eaec6d539dbc984a90b9)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "sparc/PCI: Add IORESOURCE_MEM_64 for 64-bit resource in OF parsing"

This reverts commit 9594d268aa40997f3a47e5a8c64463f32db226ff. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit b2fd930d3e95db05979aefc42291be1f6ba2c625)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "sparc/PCI: Keep resource idx order with bridge register number"

This reverts commit ed426d2bf0fa5244e6cfd531e25c50098df6b3cf. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 09dd990a13d7ea2a257f5be189f109fa80259599)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "PCI: kill wrong quirk about M7101"

This reverts commit 2bd1643612ce5560c80856f369ae81fc80e0331e. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit de113e58a8cc67f7daa128c33ac49a33935dc767)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "OF/PCI: Add IORESOURCE_MEM_64 for 64-bit resource"

This reverts commit 59072574902b6e44574c34a9989ec6ec6c494f99. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit c0691adf4a28993c77371969ff7800901ff1a67b)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "PCI: Check pref compatible bit for mem64 resource of PCIe device"

This reverts commit 40990901003330e03db71f9f7b8966a6b5c6e5ba. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit bb9160790d346bcbfc0b5ca701e7e6f6d5d56f87)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "PCI: Only treat non-pref mmio64 as pref if all bridges have MEM_64"

This reverts commit 18d135012bbc70a30df2a664c4470ce94430c9ea. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 2998a347e93c41d8677e73510a139aa54339c88f)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "PCI: Add has_mem64 for struct host_bridge"

This reverts commit 9bb848b4f153ca43e05ad976274c8281240d5adc. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 4e49b00da550d255a9470d4ffcffb6dea2227570)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "PCI: Only treat non-pref mmio64 as pref if host bridge has mmio64"

This reverts commit d39428713e56e17b1558a1b69828e514b29b18e5. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 0c060ada1d50dd5f423eed0e2ba4f3e15c4579b0)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "PCI: Restore pref MMIO allocation logic for host bridge without mmio64"

This reverts commit 88f97208c7a78187924924f741f864d6a3ccc62e. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 5b6a6cf1e2785b6bf9153c726e09435c05921040)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "sparc: Accommodate mem64_offset != mem_offset in pbm configuration"

This reverts commit 6037ec961e10e6e162c564d5306b2c2fddfb5b77. This
commit causes hotplug to break as documented in Orabug 22855133.

Orabug: 22855133

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 7d551cd315fb544b9b815566737de6e146fd0cd8)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

PCI: Prevent VPD access for QLogic ISP2722

QLogic ISP2722-based 16/32Gb Fibre Channel to PCIe Adapter has the VPD
access issue too, while read the common pci-sysfs access interface shown as

/sys/devices/pci0000:00/0000:00:03.2/0000:0b:00.0/vpd

with simple 'cat' could cause system hang and panic:

  Kernel panic - not syncing: An NMI occurred. Depending on your system the reason for the NMI is logged in any one of the following resources:
  1. Integrated Management Log (IML)
  2. OA Syslog
  3. OA Forward Progress Log
  4. iLO Event Log
  CPU: 0 PID: 15070 Comm: udevadm Not tainted 4.1.12
  Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015
   0000000000000086 000000007f0cdf51 ffff880c4fa05d58 ffffffff817193de
   ffffffffa00b42d8 0000000000000075 ffff880c4fa05dd8 ffffffff81714072
   0000000000000008 ffff880c4fa05de8 ffff880c4fa05d88 000000007f0cdf51
  Call Trace:
   <NMI>  [<ffffffff817193de>] dump_stack+0x63/0x81
   [<ffffffff81714072>] panic+0xd0/0x20e
   [<ffffffffa00b390d>] hpwdt_pretimeout+0xdd/0xe0 [hpwdt]
   [<ffffffff81021fc9>] ? sched_clock+0x9/0x10
   [<ffffffff8101c101>] nmi_handle+0x91/0x170
   [<ffffffff8101c10c>] ? nmi_handle+0x9c/0x170
   [<ffffffff8101c5fe>] io_check_error+0x1e/0xa0
   [<ffffffff8101c719>] default_do_nmi+0x99/0x140
   [<ffffffff8101c8b4>] do_nmi+0xf4/0x170
   [<ffffffff817232c5>] end_repeat_nmi+0x1a/0x1e
   [<ffffffff815d724b>] ? pci_conf1_read+0xeb/0x120
   [<ffffffff815d724b>] ? pci_conf1_read+0xeb/0x120
   [<ffffffff815d724b>] ? pci_conf1_read+0xeb/0x120
   <<EOE>>  [<ffffffff815db4b3>] raw_pci_read+0x23/0x40
   [<ffffffff815db4fc>] pci_read+0x2c/0x30
   [<ffffffff8136f612>] pci_user_read_config_word+0x72/0x110
   [<ffffffff8136f746>] pci_vpd_pci22_wait+0x96/0x130
   [<ffffffff8136ff9b>] pci_vpd_pci22_read+0xdb/0x1a0
   [<ffffffff8136ea30>] pci_read_vpd+0x20/0x30
   [<ffffffff8137d590>] read_vpd_attr+0x30/0x40
   [<ffffffff8128e037>] sysfs_kf_bin_read+0x47/0x70
   [<ffffffff8128d24e>] kernfs_fop_read+0xae/0x180
   [<ffffffff8120dd97>] __vfs_read+0x37/0x100
   [<ffffffff812ba7e4>] ? security_file_permission+0x84/0xa0
   [<ffffffff8120e366>] ? rw_verify_area+0x56/0xe0
   [<ffffffff8120e476>] vfs_read+0x86/0x140
   [<ffffffff8120f3f5>] SyS_read+0x55/0xd0
   [<ffffffff81720f2e>] system_call_fastpath+0x12/0x71
  Shutting down cpus with NMI
  Kernel Offset: disabled
  drm_kms_helper: panic occurred, switching back to text console

So blacklist the access to its VPD.

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
CC: stable@vger.kernel.org # v4.6+
(cherry picked from commit 0d5370d1d85251e5893ab7c90a429464de2e140b)

Orabug: 25975482

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

PCI: Prevent VPD access for buggy devices

On some devices, reading or writing VPD causes a system panic.
This can be easily reproduced by running "lspci -vvv" or
"cat /sys/bus/devices/XX../vpd".

Blacklist these devices so we don't access VPD data at all.

[bhelgaas: changelog, comment, drop pci/access.c changes]
Link: https://bugzilla.kernel.org/show_bug.cgi?id=110681
Tested-by: Shane Seymour <shane.seymour@hpe.com>
Tested-by: Babu Moger <babu.moger@oracle.com>
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
(cherry picked from commit 7c20078a8197389eead62399419fdc4f8ac4a8a3)

Orabug: 25975482

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Conflicts:
drivers/pci/quirks.c

target: consolidate backend attribute implementations

Orabug: 25791789

Provide a common sets of dev_attrib attributes for all devices using the
generic SPC/SBC parsers, and a second one with the minimal required read-only
attributes for passthrough devices. The later is only used by pscsi for now,
but will be wired up for the full-passthrough TCMU use case as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
(cherry picked from commit 5873c4d157400ade4052e9d7b6259fa592e1ddbf)
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Shan Hai <shan.hai@oracle.com>

target: simplify backend driver registration

Orabug: 25791789

Rewrite the backend driver registration based on what we did to the fabric
drivers:  introduce a read-only struct target_bakckend_ops that the driver
registers, which is then instanciate as a struct target_backend by the
core.  This allows the ops vector to be smaller and allows us to mark it
const.  At the same time the registration function can set up the
configfs attributes, avoiding the need to add additional boilerplate code
for that to the drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
(cherry picked from commit 0a06d4309dc168dfa70cec3cf0cd9eb7fc15a2fd)
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Shan Hai <shan.hai@oracle.com>

x86/tsc: Enumerate SKL cpu_khz and tsc_khz via CPUID

Skylake CPU base-frequency and TSC frequency may differ
by up to 2%.

Enumerate CPU and TSC frequencies separately, allowing
cpu_khz and tsc_khz to differ.

The existing CPU frequency calibration mechanism is unchanged.
However, CPUID extensions are preferred, when available.

CPUID.0x16 is preferred over MSR and timer calibration
for CPU frequency discovery.

CPUID.0x15 takes precedence over CPU-frequency
for TSC frequency discovery.

Signed-off-by: Len Brown <len.brown@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/b27ec289fd005833b27d694d9c2dbb716c5cdff7.1466138954.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit aa297292d708e89773b3b2cdcaf33f01bfa095d8)

Orabug: 25948913

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

x86/tsc: Save an indentation level in recalibrate_cpu_khz()

... by flipping the check.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1459837795-2588-5-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit eff4677e9fb9b680d1d5f6ba079116548d072b7e)

Orabug: 25948913

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Conflicts:
arch/x86/kernel/tsc.c

x86/tsc_msr: Remove irqoff around MSR-based TSC enumeration

Remove the irqoff/irqon around MSR-based TSC enumeration,
as it is not necessary.

Also rename: try_msr_calibrate_tsc() to cpu_khz_from_msr(),
as that better describes what the routine does.

Signed-off-by: Len Brown <len.brown@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/a6b5c3ecd3b068175d2309599ab28163fc34215e.1466138954.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 02c0cd2dcf7fdc47d054b855b148ea8b82dbb7eb)

Orabug: 25948913

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

perf/x86: Fix time_shift in perf_event_mmap_page

Commit:

b20112edeadf ("perf/x86: Improve accuracy of perf/sched clock")

allowed the time_shift value in perf_event_mmap_page to be as much
as 32. Unfortunately the documented algorithms for using time_shift
have it shifting an integer, whereas to work correctly with the value
32, the type must be u64.

In the case of perf tools, Intel PT decodes correctly but the timestamps
that are output (for example by perf script) have lost 32-bits of
granularity so they look like they are not changing at all.

Fix by limiting the shift to 31 and adjusting the multiplier accordingly.

Also update the documentation of perf_event_mmap_page so that new code
based on it will be more future-proof.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: b20112edeadf ("perf/x86: Improve accuracy of perf/sched clock")
Link: http://lkml.kernel.org/r/1445001845-13688-2-git-send-email-adrian.hunter@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit b9511cd761faafca7a1acc059e792c1399f9d7c6)

Orabug: 25948913

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

perf/x86: Improve accuracy of perf/sched clock

When TSC is stable perf/sched clock is based on it.
However the conversion from cycles to nanoseconds
is not as accurate as it could be. Because
CYC2NS_SCALE_FACTOR is 10, the accuracy is +/- 1/2048

The change is to calculate the maximum shift that
results in a multiplier that is still a 32-bit number.
For example all frequencies over 1 GHz will have
a shift of 32, making the accuracy of the conversion
+/- 1/(2^33). That is achieved by using the
'clocks_calc_mult_shift()' function.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1440147918-22250-1-git-send-email-adrian.hunter@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit b20112edeadf0b8a1416de061caa4beb11539902)

Orabug: 25948913

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

x86/apic: Handle zero vector gracefully in clear_vector_irq()

If x86_vector_alloc_irq() fails x86_vector_free_irqs() is invoked to cleanup
the already allocated vectors. This subsequently calls clear_vector_irq().

The failed irq has no vector assigned, which triggers the BUG_ON(!vector) in
clear_vector_irq().

We cannot suppress the call to x86_vector_free_irqs() for the failed
interrupt, because the other data related to this irq must be cleaned up as
well. So calling clear_vector_irq() with vector == 0 is legitimate.

Remove the BUG_ON and return if vector is zero,

[ tglx: Massaged changelog ]

Fixes: b5dc8e6c21e7 "x86/irq: Use hierarchical irqdomain to manage CPU interrupt vectors"
Signed-off-by: Keith Busch <keith.busch@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 1bdb8970392a68489b469c3a330a1adb5ef61beb)

Orabug: 24515998,25975565

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
arch/x86/kernel/apic/vector.c

dtrace: improve io provider coverage

The DTrace io provider coverage is extended to include IO performed
through the Generic Block Layer (bio). It also adds io provider probes
to XFS.

Orabug: 25816537

Signed-off-by: Nicolas Droux <nicolas.droux@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

dtrace: proc:::exit should trigger only if thread group exits

Orabug: 25904298

Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>

HID: hid-cypress: validate length of report

Make sure we have enough of a report structure to validate before
looking at it.

Orabug: 25795985
CVE: CVE-2017-7273

Reported-by: Benoit Camredon <benoit.camredon@airbus.com>
Tested-by: Benoit Camredon <benoit.camredon@airbus.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
(cherry picked from commit 1ebb71143758f45dc0fa76e2f48429e13b16d110)
Signed-off-by: Aniket Alshi <aniket.alshi@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

ctf: prevent modules on the dedup blacklist from sharing any types at all

The deduplication blacklist exists to deal with a few modules, mostly
old ISA sound drivers, which consist of a #define and a #include of
another module, which then uses #ifdef profusely, sometimes on structure
members.  This leads to DWARF which has differently shaped structures
in the same source location, which conflicts with the purpose of the
deduplicator, which is to use the source location to identify identical
structures and merge them.

Before now, the blacklist worked by preventing the appearance of a
structure in a module on the blacklist from making a structure shared:
it could still be shared if other modules defined it.  This fails to
work for two reasons.

Firstly, if it is used by only *one* other module, we'll hit an
assertion failure in the emission phase which checks that any type it
emits is either defined in the module being emitted or in the shared
type repo.

We could remove that assertion, but we'd still be in the wrong, because
you could easily have a type in some header used by many modules that
said

struct trouble {
#ifdef MODULE_FOO
/* one definition */
#else
/* a different definition */
#endif
}

and a module that says

#define MODULE_FOO 1
#include <other_module.c>

Even if we blacklisted this module (and we would), this would still
fail, because 'struct trouble' would end up in the shared type
repository, and the existing code would fail to emit a new definition of
it in module blah, even though it should because its definition is
different.

This shows that if a module is pulling tricks like this we cannot trust
its use of shared types at all, since we have no visibility into
preprocessor definitions: regardless of the space cost (about 40KiB per
module), we cannot let it share any types whatsoever with the rest of
the kernel.  Rather than piling heaps of blacklist checks in all over
dwarf2ctf to implement this, we do it by exploiting the fact that the
deduplicator works entirely on the basis of type IDs.  We add a
'blacklist type prefix' that type_id() can add to the start of its IDs
(with some extra decoration, because the start of type IDs is a file
path, so we want this to be an impossible path).  If we set this prefix
to the module name if and only if the module is blacklisted, and do not
add one otherwise, then every blacklisted module will have a unique set
of IDs for all its types, which can be shared within the module but not
outside it, so every type in the module will be unique and none of them
will end up in the shared type repository.

While we're at it, add yet another ancient ISA sound driver that plays
the same games to the blacklist.

This fix makes blacklisting modules much more space-expensive: each such
module expands the current size of the kernel module package by about
40KiB.  (But there is only one blacklisted module built in current UEK,
so this is a tolerable cost.)

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: vincent.lim@oracle.com
Orabug: 26137220

ctf: emit bitfields in in-memory order

DWARF always emits bitfields in declaration order. e.g., for
this C:

struct IO_APIC_route_entry {
        __u32   vector          :  8,
                delivery_mode   :  3,   /* 000: FIXED
                                         * 001: lowest prio
                                         * 111: ExtINT
                                         */
                dest_mode       :  1,   /* 0: physical, 1: logical */
                delivery_status :  1,

we get this DWARF:

[  437f]    structure_type
             name                 (strp) "IO_APIC_route_entry"
             byte_size            (data1) 8
[  438b]      member
               name                 (strp) "vector"
               byte_size            (data1) 4
               bit_size             (data1) 8
               bit_offset           (data1) 24
               data_member_location (data1) 0
[  439a]      member
               name                 (strp) "delivery_mode"
               byte_size            (data1) 4
               bit_size             (data1) 3
               bit_offset           (data1) 21
               data_member_location (data1) 0
[  43a9]      member
               name                 (strp) "dest_mode"
               byte_size            (data1) 4
               bit_size             (data1) 1
               bit_offset           (data1) 20
               data_member_location (data1) 0
[  43b8]      member
               name                 (strp) "delivery_status"
               byte_size            (data1) 4
               bit_size             (data1) 1
               bit_offset           (data1) 19
               data_member_location (data1) 0

But CTF on little-endian requires the opposite: it has special handling
for the first member of a structure which assumes that it is closest to
the start of memory: in effect, it wants structure member addresses to
always ascend, even within bitfields, regardless of endianness (which
makes some sense intellectually as well).

dwarf2ctf's emission code generally emits sequentially, so except where
deduplication has eliminated items or dependent type insertion has added
them it emits things in the CTF in the same order as in the DWARF.  We
can avoid this for short runs, as in this case, by switching from
iteration to recursion in such cases, spotting a run at identical
data_member_location, recursing until we hit the end of the run, then
unwinding and emitting as we unwind until the recursion is over.

Orabug: 25815129
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: bitfield support

Support for bitfields in dwarf2ctf was embryonic and largely untested
before now due to bugs in libdtrace-ctf: with a fix for those in hand,
we can fix bitfields here too.

Bitfields in DWARF and CTF have annoyingly different representations.
In DWARF, a bitfield is represented something like this:

[ 16561]      member
               name                 (string) "ihl"
               decl_file            (data1) 225
               decl_line            (data1) 87
               type                 (ref4) [    38]
               byte_size            (data1) 1
               bit_size             (data1) 4
               bit_offset           (data1) 4
               data_member_location (sdata) 0
[...]
[    38]    typedef
             name                 (strp) "__u8"
             decl_file            (data1) 36
             decl_line            (data1) 20
             type                 (ref4) [    43]

i.e. the padding, size, and starting location are all represented in the
member, where you would conceptually expect it to be.

In CTF, the starting location of the conceptual containing type of a
bitfield is encoded in the member: but the size and starting location of
the bitfield itself is represented in the dependent type, which is added
as a "non-root" type (which cannot be looked up by name) so that it can
have the same name as the un-bitfielded base type without causing a name
clash.

We use the new DIE attribute override mechanism added in commit 8935199962
to override DW_AT_bit_size and DW_AT_bit_offset for such members (fixing
a pre-existing bug in the process: we were looking for the DW_AT_bit_size
on the structure as a whole!), and in the base-type emission function
checking for the existence of a DW_AT_bit_size/offset and responding to
them by overriding the size and offset derived from DW_AT_byte_size and
noting that this is a non-root type.  (The override needed, annoyingly,
is endian-dependent, because CTF consumers assume that on little-endian
systems the offset relates to the least-significant edge of the bitfield,
counting from the LSB, while DWARF assumes the opposite).

But this is not enough: unless more is done, this type will appear
to have the same type ID as its non-bitfield equivalent, leading to
confusion about which CTF file it should appear in and quite possibly
leading to it ending up in a CTF file that the structure containing the
bitfield cannot even see.  So augment type_id()'s representation of
base types from e.g. 'int' to something like 'int:4' if and only if
a DW_AT_bit_size or an override of it is present and that override is
a different size from the native bitness of the type itself (the
DW_AT_byte_size).  We encode the bit_offset only if there is also a
bit_size, as something like int:4:2.  (That's unambiguous because
these attributes always arrive in pairs in bitfields and never appear
in anything else in C-generated DWARF.)

Finally, this breaks an optimization in the deduplicator, which was that
all structure members reference some top-level type, so when marking a
type as seen, structure members could just be skipped.  Now, they have
to be chased iff they are bitfields using the same override trickery as
above to change the DW_AT_bit_size/offset in the member's type DIE, and
that bitfield override needs to be passed down to type_id() when finally
marking duplicated types as shared too.  (Avoid code duplication by
factoring out some related code from a horrible conditional in
detect_duplicates() into a new type_needs_sharing() function.)

Orabug: 25815129
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: emit file-scope static variables

For as long as dwarf2ctf has emitted variables into the CTF, it has
emitted only extern variables.  This is for two reasons: firstly, a
misconception that global variables with static linkage did not appear
in /proc/kallmodsyms (they do), and secondly, an ambiguity problem.

We cannot usefully distinguish two variables with the same name in the
same module: they differ only by address, so if both are static
variables in different translation units, we can't tell which is which.
(Also, we cannot emit more than one variable with a given name into a
given module CTF file in any case).  CTF's modular rather than
translation-unit-based variable/type scope bites us here.

I sought to avoid this bug by emitting only non-static variables, but
this does not save us, because there might be another static variable
with the same name in the same module, whereupon the ambiguity problem
arises all over again.  We must identify such ambiguous cases and strip
them out (not emitting CTF for this variable at all): then we can emit
static file-scope variables into the CTF without worry.

We do this by introducing a new, module-scope vars_seen hashtable into
the deduplicator state, which gains an entry for every variable name
seen in this module, and indicates whether it is static or not.  This
lets us tell if we have seen a variable with a given name more than
once, and if we have, whether any of the instances was static.  Then
we can consult this blacklist at variable-emission time, and skip any
variable if it is in the blacklist.

Unfortunately, computing the name for the variable's entry in the
blacklist is fairly expensive, and has to be done for every variable:
worse yet, this increases the number of variables emitted drastically
(in vmlinux and the shared typo repo alone, we go from 2247 to 10409
variables), and emitting that much CTF is not free: so the runtime goes
up by about 5%.  We will reclaim this lost speed soon (and then some).

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com
Orabug: 25962387

ctf: speed up the dwarf2ctf duplicate detector some more

The duplicate detector's alias detection pass is still doing unnecessary
work.  When a non-opaque structure is marked shared, it is sometimes
necessary to do another deduplication scan, because the marking may have
marked types used within the structure as shared, which may require yet
more types (e.g. opaque uses of of that type in other modules) to be
marked shared as well.  So while we know this pass can only affect
structures/ unions/enums that have names, and their interior types, it
would seem that we must keep scanning them to see if they need
deduplication until none remain.

However, there is one exception: if a non-opaque type and its
corresponding opaque type are both already marked shared, or if we have
just processed them and marked them accordingly, we know that we will
never need to re-mark those particular types again, since they can't be
more shared than they already are: so we can remove them from
consideration in future passes.  Because we are only opening DWARF files
in this pass as needed now, this hugely cuts down the number of files we
process in subsequent passes: we still see the same number of passes,
but passes after the first (which marks tens of thousands of opaque
types as shared) only open a few files, mark a few hundred types, and
flash past in under a second.

In my tests, the alias fixup pass now takes under 10s, which can be
more or less ignored: all other passes other than initialization
and writeout are much more expensive.  (Before this series, it took
over a minute on the fastest machine I have access to, and over three
minutes on SPARC.)

Orabug: 25815306
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: strdup() -> xstrdup()

Several unadorned strdup()s have crept into dwarf2ctf: all of them
should be xstrdup(), since none are handling malloc failure themselves.

Orabug: 25815306
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: speed up the dwarf2ctf duplicate detector

dwarf2ctf is very slow.  By far its most expensive pass is the duplicate
detector.  Of necessity this has to scan every object file in the kernel
to identify shared types, which is pretty expensive even when the cache
is hot, but it's not doing this particularly efficiently and can be sped
up quite a lot.

As of commit f0d0ebd0b4, the duplicate detector's job was cut in two:
the first pass identifies all non-structure types and non-opaque
structures used by more than one module, and the second "alias fixup"
pass repeatedly unifies opaque and non-opaque structures with the same
name in different translation units, sharing their definitions.  Both of
these passes generate or modify type IDs, so both need access to the
DWARF DIE of the types in question, since the type ID is derived
recursively from the DIE: but the second pass need not look at the DWARF
of any translation units that do not contain structures that might be
unified.  However, the two are currently written in the same way, using
process_file() to traverse all the kernel's DWARF, even though the alias
fixup pass does almost nothing with that DWARF, and has less and less to
do on each iteration.

The sheer amount of wasted time here is remarkable.  We traverse the
DWARF once for primary duplicate detection, once for CTF emission, but
often four or five or even seven or eight times for the alias fixup pass
(the precise count depends upon the relationships between types and the
order in which the DWARF files are traversed).

So improve things by tracking all types that the alias fixup pass is
interested in (structure types that are not anonymous inner structures
nor opaque nor used only as the types of array indices) and stash them
away during the first duplicate detection pass in a new temporary
singly-linked list, detect_duplicates_state.named_structs.  We remember
the filename and DWARF DIE offset (so we can look the type up again) and
the type ID (because we just worked it out, so recomputing it would be a
waste of time).  Then, rather than doing a process_file() for the alias
fixup pass, traverse the linked list, opening DWARF files as needed to
mark things as shared (but no more often than that: marking non-opaque
types needs the DIE so we can traverse into all its subtypes and mark
them shared too, but marking opaque types needs no DIE at all).

This has a significant effect on dwarf2ctf speed, slashing 25% off its
runtime in my tests, reducing the duplicate detector's share of the
runtime from about 80s to about 24s.

The dominant time consumer is now CTF emission rather than the
duplicate detector.

Orabug: 25815306
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: add module parameter to simple_dwfl_new() and adjust both callers

An upcoming speedup to dwarf2ctf needs this.

Orabug: 25815306
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: fix the size of int and avoid duplicating it

An upcoming bitfield-capable release of libdtrace-ctf adds extra
consistency checking which identified embarrassing but heretofore
harmless bugs in dwarf2ctf's representation of basic types.

dwarf2ctf has to emit a few basic types "by hand", since these are used
in the representation of types one of CTF or DWARF does not bother to
encode more complex forms for (function pointers, always encoded as
'int (*)()' in CTF, and 'void').  Embarrassingly, we were getting the
size of 'int' wrong: it should be in bits but we were emitting a count
of bytes instead, leading to a CTF representation of a 4-bit int.
This is always overridden by an accurate representation built into
DTrace in real use, but libdtrace-ctf finds the inconsistency anyway.

Worse is that it emits a representation of 'int' twice, once by hand in
init_ctf_table() and then again when it comes across the real 'int' type
DIE in the debuginfo.  This is because we forgot to intern the types we
add by hand in the id_to_module and id_to_type hashes we use to detect
duplicate types, because we can only intern types we have a DWARF DIE
for, and these types either have no DIE at all or we just don't know
about it because we haven't even begun to traverse the debuginfo to look
up DIEs yet.

Fix this by adding a mark_shared_by_name() function which can be used to
intern basic types by name in these hashes.  It has hardwired knowledge
of the type ID notation, but no more such knowledge than is already
present in detect_duplicates_alias_fixup() and friends (no more than
that types with no associated filename or line number are preceded by
"////".)

Orabug: 25815129
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

ctf: allow overriding of DIE attributes: use it for parent bias

Currently, dwarf2ctf's CTF type emission machinery emits any given CTF
type based purely on the data in the corresponding DWARF DIE, with
possible adjustments based on the DIE's structural parent (useful for
structure members, if not so much for top-level types).  But from time
to time we need to adjust a CTF type depending on some property not of
its structural parent but of a type that depends on it: i.e., for
structures, we might want to adjust the offset of the members to cater
for the fact that this structure is being structurally merged with a
shorter assignment-compatible structure with identical name which
appeared in some translation unit that was processed earlier.

Currently, we handle this case, and only this case, by passing down a
'parent_bias' to all the CTF assembly functions.  Replace this with a
more generic mechanism whereby an array of 'overrides' can be passed
down to construct_ctf_id(), die_to_ctf(), and all subordinate assembly
functions: these overrides consist of an array of die_override_t's,
where each element can either override or add to the value of one DWARF
attribute: this kicks in for specific DWARF tags only. (Only numerical
attributes are supported, obviously.)

(We also pass the overrides down to type_id() so that overrides can
affect the ID of types and thus cause a single DWARF type to generate
multiple CTF types, though we do not use this facility in this commit.)

To process the attributes, we introduce a new private_find_override() to
search the override list, and private_dwarf_udata() to fetch a udata and
handle it. (We do not override dwarf_hasattr(): anything that wants to
override an attribute that may not exist has to call
private_find_override() itself.  If this happens a lot, we can introduce
an override for dwarf_hasattr() too.)

Currently we use this in exactly one place: in assemble_ctf_su_member(),
to replace the use of parent_bias.  Further uses will come in the next
commit: thanks to this commit, none of them will require adding new
parameters to all the CTF construction functions :)

Also rename the 'override' parameter on the CTF construction functions,
which was used by array assembly to indicate that CTF types should
replace their parent type, with a much less confusingly-named 'replace'
parameter.  (It was badly named before, but now that we have a parameter
named 'overrides' it is devastatingly badly named.)

Orabug: 25815129
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: tomas.jedlicka@oracle.com

DTrace tcp/udp provider probes

This patch adds DTrace SDT probes for the TCP and UDP protocols.  For tcp
the following probes are added:

tcp:::send Fires when a tcp segment is transmitted
tcp:::receive Fires when a tcp segment is received
tcp:::state-change Fires when a tcp connection changes state
tcp:::connect-request Fires when a SYN segment is sent
tcp:::connect-refused Fires when a RST is received for connection attempt
tcp:::connect-established
Fires when three-way handshake completes for
initiator

tcp:::accept-refused Fires when a RST is sent refusing connection attempt
tcp:::accept-established
Fires when three-way handshake succeeds for acceptor

Arguments for all of these probes are:

arg0 struct sk_buff *; to be translated into pktinfo_t * containing
implementation-independent packet data
arg1 struct sock *; to be translated into csinfo_t * containing
implementation-independent connection data
arg2 __dtrace_tcp_void_ip_t *; to be translated into ipinfo_t * containing
implementation-independent IP information.  Custom type is used as
this gives DTrace a hint that we can source IP information from other
arguments if the IP header is not available.
arg3 struct tcp_sock *; to be translated into tcpsinfo_t * containing
implementation-independent TCP connection data
arg4 struct tcphdr *; to be translated into a tcpinfo_t * containing
implementation-independent TCP header data
arg5 int representing previous state; to be translated into a
tcplsinfo_t * which contains the previous state.  Differs from
current state (arg6) for state-change probes only.
arg6 int representing current state.  Cannot be sourced from struct
tcp_sock as we sometimes need to probe before state change is
reflected there
arg7 int representing direction of traffic for probe; values are
DTRACE_NET_PROBE_INBOUND for receipt of data and
DTRACE_NET_PROBE_OUTBOUND for transmission.

For udp the following probes are added:

udp:::send Fires when a udp datagram is sent
udp:::receive Fires when a udp datagram is received

Arguments for these probes are:

arg0 struct sk_buff *; to be translated into pktinfo_t * containing
        implementation-independent packet data
arg1    struct sock *; to be translated into csinfo_t * containing
        implementation-independent connection data
arg2    void_ip_t *; to be translated into ipinfo_t * containing
        implementation-independent IP information.
arg3 struct udp_sock *; to be translated into a udpsinfo_t * containing
implementation-independent UDP connection data
arg4 struct udphdr *; to be translated into a udpinfo_t * containing
implementation-independent UDP header information.

Orabug: 25815197

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Reviewed-by: Rao Shoaib <rao.shoaib@oracle.com>
Acked-by: Håkon Bugge <haakon.bugge@oracle.com>

dtrace: define DTRACE_PROBE_ENABLED to 0 when !CONFIG_DTRACE

Right now there is no definition for this at all, breaking kernel
compilation.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Nicolas Droux <nicolas.droux@oracle.com>
Orabug: 26145788

dtrace: ensure limit is enforced even when pcs is NULL

The dtrace_user_stacktrace() functions for x86_64 and sparc64 were
not handling the specified limit (st->limit correctly if the buffer
for PC values (st->pcs) was NULL. This commit ensures that we
decrement the limit whenever we encounter a PC, whether it gets
stored or not.

Orabug: 25949692
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: make x86_64 FBT return probe detection less restrictive

The FBT return probe detection mechanism on x86_64 was requiring that
the "ret" instruction be followed by a "push %rbp" or "nop", which is
much too restrictive. The new code allows probing of all "ret"
instructions that occur in a function regardless of what instructions
follows.

In order to make this possible, the functions to set and clear FBT
probes on x86_64 (dtrace_invop_add() and dtrace_invop_remove()) have
been modified to accept a 2nd argument that indicates the instruction
to patch the probe location with. This is needed because FBT return
probes need a different instruction on x86_64 (LOCK prefix to force
an invalid opcode trap isn't safe because we do not know what
instruction may follow the "ret").

This commit also fixes the declaration of the dtrace_bad_address()
function that was missing its return type.

Orabug: 25949048
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: support passing offset as arg0 to FBT return probes

FBT return probes pass the offset from the function start (in bytes)
as arg0. To make that possible, we pass the offset value in the call
to fbt_add_probe. For FBT entry probes we pass 0 (which is ignored).

Orabug: 25949086
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

dtrace: make FBT entry probe detection less restrictive on x86_64

The logic on x86_64 to determine whether we can probe a function is
too restrictive. By placing the probe on the "push %rbp" instruction
we can cover more functions, in case the "mov %rsp,%rbp" instruction
does not follow it immediately.

Orabug: 25949030
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Reviewed-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>