Hillf Danton noticed bugs in the hugetlb_vmtruncate_list routine. The
argument end is of type pgoff_t. It was being converted to a vaddr offset
and passed to unmap_hugepage_range. However, end was also being used as
an argument to the vma_interval_tree_foreach controlling loop. In
addition, the conversion of end to vaddr offset was incorrect.
hugetlb_vmtruncate_list is called as part of a file truncate or fallocate
hole punch operation.
When truncating a hugetlbfs file, this bug could prevent some pages from
being unmapped. This is possible if there are multiple vmas mapping the
file, and there is a sufficiently sized hole between the mappings. The
size of the hole between two vmas (A,B) must be such that the starting
virtual address of B is greater than (ending virtual address of A <<
PAGE_SHIFT). In this case, the pages in B would not be unmapped. If
pages are not properly unmapped during truncate, the following BUG is hit:
kernel BUG at fs/hugetlbfs/inode.c:428!
In the fallocate hole punch case, this bug could prevent pages from being
unmapped as in the truncate case. However, for hole punch the result is
that unmapped pages will not be removed during the operation. For hole
punch, it is also possible that more pages than desired will be unmapped.
This unnecessary unmapping will cause page faults to reestablish the
mappings on subsequent page access.
Fixes: 1bfad99ab (" hugetlbfs: hugetlb_vmtruncate_list() needs to take a range")Reported-by: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> [4.3] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 29bfa18ca579419db54d93250c199bcb54d80047) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Filipe Manana [Sun, 3 May 2015 00:56:00 +0000 (01:56 +0100)]
Btrfs: incremental send, fix clone operations for compressed extents
Marc reported a problem where the receiving end of an incremental send
was performing clone operations that failed with -EINVAL. This happened
because, unlike for uncompressed extents, we were not checking if the
source clone offset and length, after summing the data offset, falls
within the source file's boundaries.
So make sure we do such checks when attempting to issue clone operations
for compressed extents.
Problem reproducible with the following steps:
$ mkfs.btrfs -f /dev/sdb
$ mount -o compress /dev/sdb /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount -o compress /dev/sdc /mnt2
# Create the file with a single extent of 128K. This creates a metadata file
# extent item with a data start offset of 0 and a logical length of 128K.
$ xfs_io -f -c "pwrite -S 0xaa 64K 128K" -c "fsync" /mnt/foo
# Now rewrite the range 64K to 112K of our file. This will make the inode's
# metadata continue to point to the 128K extent we created before, but now
# with an extent item that points to the extent with a data start offset of
# 112K and a logical length of 16K.
# That metadata file extent item is associated with the logical file offset
# at 176K and covers the logical file range 176K to 192K.
$ xfs_io -c "pwrite -S 0xbb 64K 112K" -c "fsync" /mnt/foo
# Now rewrite the range 180K to 12K. This will make the inode's metadata
# continue to point the the 128K extent we created earlier, with a single
# extent item that points to it with a start offset of 112K and a logical
# length of 4K.
# That metadata file extent item is associated with the logical file offset
# at 176K and covers the logical file range 176K to 180K.
$ xfs_io -c "pwrite -S 0xcc 180K 12K" -c "fsync" /mnt/foo
$ btrfs send /mnt/snap1 | btrfs receive /mnt2
At subvol /mnt/snap1
At subvol snap1
$ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
At subvol /mnt/snap2
At snapshot snap2
ERROR: failed to clone extents to bar
Invalid argument
Reported-by: Marc MERLIN <marc@merlins.org> Tested-by: Marc MERLIN <marc@merlins.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: David Sterba <dsterba@suse.cz> Tested-by: Jan Alexander Steffens (heftig) <jan.steffens@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 619d8c4ef7c5dd346add55da82c9179cd2e3387e) Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Maor Gottlieb [Mon, 27 Apr 2015 10:47:56 +0000 (13:47 +0300)]
net/mlx4_core: Fix FMR unmapping to allow remapping afterward
According to device spec we only need to set ownership bit to SW at FMR
unmap, all other stuff we did are redundant and not essential.
This fix is ported from Mellanox OFED 3.1.
It covers some of the same issues fixed in UEK2 with the
following UEK2 commits: bbdc2821db04 "mlx4_core: Avoid recycling old FMR R_Keys too soon" 5bddb281c0f1 "mlx4_ib: unmap FMR should update MPT status to 0xF"
Following comments from bbdc2821db04 patch by Olaf Kirch <okir@lst.de>
adds more useful information about impact of this change on RDS usage
in Oracle applications and copied here to retain useful context.
When a FMR is unmapped, mlx4 resets the map count to 0, and clears the
upper part of the R_Key which is used as the sequence counter.
This poses a problem for RDS, which uses ib_fmr_unmap as a fence
operation. RDS assumes that after issuing an unmap, the old R_Keys
will be invalid for a "reasonable" period of time. For instance,
Oracle processes uses shared memory buffers allocated from a pool of
buffers. When a process dies, we want to reclaim these buffers -- but
we must make sure there are no pending RDMA operations to/from those
buffers. The only way to achieve that is by using unmap and sync the
TPT.
However, when the sequence count is reset on unmap, there is a high
likelihood that a new mapping will be given the same R_Key that was
issued a few milliseconds ago.
To prevent this, don't reset the sequence count when unmapping a FMR.
Kris Van Hees [Thu, 28 Jan 2016 02:36:01 +0000 (21:36 -0500)]
dtrace: support multiple instances of the same probe in a function
Up until now, no function in the kernel (or modules) ever had more
than a single instance of the same probe in a single function. Due
to compiler optimizations, sparc64 caused duplication of a probe
call in kernel code. The existing code was not adequate to deal
wih that situation, causing the probe to not get enabled in more
than one location, thereby causing it to not fire when expected.
This commit adds support to dtrace_sdt.sh to handle multiple
instances of the same probe in a single function. It also adds
equialent support in the SDT boot-time processing to ensure that
probes with multiple locations in a function are linked together
so that all locations are enabled and disabled together.
(While this was found as an artifact of testing the 4.3 kernel
series, it also applies to 4.1 because it is not unlikely that
a compiler upgrade may cause the same problem there.)
Orabug: 22514493 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Kris Van Hees [Wed, 20 Jan 2016 07:48:08 +0000 (02:48 -0500)]
dtrace: ensure signal-handled is fired with correct signal
When sending a signal to a process that neither ignores nor handles
that signal, it is translated into a SIGKILL because the signal will
effectively cause the process to be terminated. As a result, one
would see a signal-send probe for the original signal, and a
signal-handled probe for SIGKILL.
Since in these cases the original signal is retained within the
target task for exit reporting, it is possible to report the original
signal number in the signal-handled probe. This commit accomplishes
that to ensure that accurate pairing for signal probes is possible.
Orabug: 22573604 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
Kris Van Hees [Mon, 18 Jan 2016 09:19:53 +0000 (04:19 -0500)]
dtrace: ensure that PID 0 has a psinfo struct
Test builtinvar/tst.psinfo.d lists a large amount of memory access
errors trying to read from non-existant psinfo information for PID 0.
This commit enusres that prior to dtrace doing any work, PID 0 (the
init_task) has a psinfo structure associated with it.
Orabug: 22561297 Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com> Acked-by: Nick Alcock <nick.alcock@oracle.com>
In IPoIB code, parallel access of tx_outstanding (send path VS event
process path) needs to be serialized. We use priv->lock to protect it. We
also use that lock to make the stop/open tx queue to be atomic with the
changing on tx_outstanding to protect the race between the opening of tx
queue from completion hander and the closing from the send path.
This patch also make sure the increase of tx_outstanding prior to the
calling of post_send to avoid the possible decreasing before increasing in
case the running of increasing is scheduled later than the event handler.
Impact on Throuput is ~1.5% drop.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Preserve any rpcrdma_req that is attached to rpc_rqst's allocated
for the backchannel. Otherwise, after all the pre-allocated
backchannel req's are consumed, incoming backward calls start
writing on freed memory.
Somehow this hunk got lost.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: John Sobecki <john.sobecki@oracle.com> Tested-by: Dai Ngo <dai.ngo@oracle.com>
A value of 2560 (1280k) will accommodate a 10-data-disk stripe
write with chunk size 128k. In the testing I've done using
iozone, fio, and aio-stress across a number of different storage
devices, a value of 1280 does not show a big performance
difference from 512, but will hopefully help software RAID
setups using SATA disks, as reported by Christoph.
NOTE: drivers/block/aoe/aoeblk.c sets its own max_hw_sectors_kb to
BLK_DEF_MAX_SECTORS. So, this patch essentially changes aeoblk to
Use a larger maximum sector size, and I did not test this.
EXT4-fs warning (device dm-X): ext4_end_bio:313: I/O error writing to ino
Below is revert text from mainline:
This reverts commit 34b48db66e08ca1c1bc07cf305d672ac940268dc.
That commit caused performance regressions for streaming I/O
workloads on a number of different storage devices, from
SATA disks to external RAID arrays. It also managed to
trip up some buggy firmware in at least one drive, causing
data corruption.
The next patch will bump the default max_sectors_kb value to
1280, which will accommodate a 10-data-disk stripe write
with chunk size 128k. In the testing I've done using iozone,
fio, and aio-stress, a value of 1280 does not show a big
performance difference from 512. This will hopefully still
help the software RAID setup that Christoph saw the original
performance gains with while still not regressing other
storage configurations.
with a usage count of 100 * the number of times the program has been run,
then the kernel is malfunctioning. If leaked-keyring has zero usages or
has been garbage collected, then the problem is fixed.
Reported-by: Yevgeny Pats <yevgeny@perception-point.io> Signed-off-by: David Howells <dhowells@redhat.com>
Orabug: 22563965
CVE: CVE-2016-0728 Acked-by: Chuck Anderson <chuck.anderson@oracle.com>
Nick Alcock [Wed, 16 Dec 2015 21:54:12 +0000 (21:54 +0000)]
sparc: increase NR_syscalls properly
When waitfd() was added, NR_syscalls was never incremented, so
the syscall always unconditionally returned -ENOSYS on SPARC.
Wire it up by incrementing NR_syscalls correctly.
(The syscall has been tested, but, it turns out, only on x86 and
on SPARC in the 4.0 kernel. Testing on SPARC with its single
user, DTrace, suggests that all is still well here.)
Orabug: 22390316 Reviewed-by: Todd Vierling <todd.vierling@oracle.com> Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Santosh Shilimkar [Fri, 8 Jan 2016 17:24:58 +0000 (09:24 -0800)]
Merge branch 'topic/uek-4.1/xen' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* 'topic/uek-4.1/xen' of git://ca-git.us.oracle.com/linux-uek:
xen/events/fifo: Consume unprocessed events when a CPU dies
Revert "xen/fb: allow xenfb initialization for hvm guests"
xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set.
xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled.
xen/pciback: Do not install an IRQ handler for MSI interrupts.
xen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or MSI-X enabled
xen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or MSI-X enabled
xen/pciback: Save xen_pci_op commands before processing it
xen-scsiback: safely copy requests
xen-blkback: read from indirect descriptors only once
xen-blkback: only read request operation from shared ring once
xen-netback: use RING_COPY_REQUEST() throughout
xen-netback: don't use last request to determine minimum Tx credit
xen: Add RING_COPY_REQUEST()
John Haxby [Wed, 9 Dec 2015 05:30:22 +0000 (16:30 +1100)]
ocfs2: return non-zero st_blocks for inline data
Some versions of tar assume that files with st_blocks == 0 do not contain
any data and will skip reading them entirely. See also commit 9206c561554c ("ext4: return non-zero st_blocks for inline data").
Signed-off-by: John Haxby <john.haxby@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Acked-by: Gang He <ghe@suse.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit ca426103429543e7a9be9017537fc3ffc37b5724)
Orabug: 22218243 Signed-off-by: John Haxby <john.haxby@oracle.com>
Ross Lagerwall [Fri, 19 Jun 2015 15:15:57 +0000 (16:15 +0100)]
xen/events/fifo: Consume unprocessed events when a CPU dies
When a CPU is offlined, there may be unprocessed events on a port for
that CPU. If the port is subsequently reused on a different CPU, it
could be in an unexpected state with the link bit set, resulting in
interrupts being missed. Fix this by consuming any unprocessed events
for a particular CPU when that CPU dies.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Cc: <stable@vger.kernel.org> # 3.14+ Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 3de88d622fd68bd4dbee0f80168218b23f798fd0)
Orabug: 22498877 Tested-by: Carson Hovey <carson.hovey@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Santosh Shilimkar [Fri, 8 Jan 2016 16:51:39 +0000 (08:51 -0800)]
Merge branch 'uek4/xsa157' of git://ca-git.us.oracle.com/linux-konrad-public into topic/uek-4.1/xen
* 'uek4/xsa157' of git://ca-git.us.oracle.com/linux-konrad-public:
xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set.
xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled.
xen/pciback: Do not install an IRQ handler for MSI interrupts.
xen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or MSI-X enabled
xen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or MSI-X enabled
Santosh Shilimkar [Fri, 8 Jan 2016 16:51:04 +0000 (08:51 -0800)]
Merge branch 'uek4/xsa155' of git://ca-git.us.oracle.com/linux-konrad-public into topic/uek-4.1/xen
* 'uek4/xsa155' of git://ca-git.us.oracle.com/linux-konrad-public:
xen/pciback: Save xen_pci_op commands before processing it
xen-scsiback: safely copy requests
xen-blkback: read from indirect descriptors only once
xen-blkback: only read request operation from shared ring once
xen-netback: use RING_COPY_REQUEST() throughout
xen-netback: don't use last request to determine minimum Tx credit
xen: Add RING_COPY_REQUEST()
which is:
Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Date: Fri Jan 3 19:02:09 2014 +0000
xen/fb: allow xenfb initialization for hvm guests
There is no reasons why an HVM guest shouldn't be allowed to use xenfb.
as "Xend" exposes an PV and VGA (via QEMU) driver to the guests (either
PV or HVM).
The backed in this case is QEMU - and it only provides the emulation
for the VGA backend (for HVM guests). Upstream QEMU can do both - but
since we are using Xend - we only expose QEMU-traditional.
Upstreamwise we hadn't reached a good decisions. Poking for XenStore
keys to see if the backend is Xend vs xl seems odd. And the issue
goes away in 'xl' as you can't define both PV and VGA drivers. Only
one of them is allowed.
The 'proper' fix would be to teach Xend to not expose PV FB for HVM
guests but that code is pretty genric.
So continuing on with this revert.
OraBug: 20386370 - XENBUS_PROBE_FRONTEND: TIMEOUT CONNECTING TO DEVICE EROR WITH UEK4 Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Mon, 2 Nov 2015 23:13:27 +0000 (18:13 -0500)]
xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set.
commit f598282f51 ("PCI: Fix the NIU MSI-X problem in a better way")
teaches us that dealing with MSI-X can be troublesome.
Further checks in the MSI-X architecture shows that if the
PCI_COMMAND_MEMORY bit is turned of in the PCI_COMMAND we
may not be able to access the BAR (since they are memory regions).
Since the MSI-X tables are located in there.. that can lead
to us causing PCIe errors. Inhibit us performing any
operation on the MSI-X unless the MEMORY bit is set.
Note that Xen hypervisor with:
"x86/MSI-X: access MSI-X table only after having enabled MSI-X"
will return:
xen_pciback: 0000:0a:00.1: error -6 enabling MSI-X for guest 3!
When the generic MSI code tries to setup the PIRQ without
MEMORY bit set. Which means with later versions of Xen
(4.6) this patch is not neccessary.
This is part of XSA-157
CC: stable@vger.kernel.org Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 408fb0e5aa7fda0059db282ff58c3b2a4278baa0) Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled.
Otherwise just continue on, returning the same values as
previously (return of 0, and op->result has the PIRQ value).
This does not change the behavior of XEN_PCI_OP_disable_msi[|x].
The pci_disable_msi or pci_disable_msix have the checks for
msi_enabled or msix_enabled so they will error out immediately.
However the guest can still call these operations and cause
us to disable the 'ack_intr'. That means the backend IRQ handler
for the legacy interrupt will not respond to interrupts anymore.
This will lead to (if the device is causing an interrupt storm)
for the Linux generic code to disable the interrupt line.
Naturally this will only happen if the device in question
is plugged in on the motherboard on shared level interrupt GSI.
This is part of XSA-157
CC: stable@vger.kernel.org Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 7cfb905b9638982862f0331b36ccaaca5d383b49) Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Mon, 2 Nov 2015 22:24:08 +0000 (17:24 -0500)]
xen/pciback: Do not install an IRQ handler for MSI interrupts.
Otherwise an guest can subvert the generic MSI code to trigger
an BUG_ON condition during MSI interrupt freeing:
for (i = 0; i < entry->nvec_used; i++)
BUG_ON(irq_has_action(entry->irq + i));
Xen PCI backed installs an IRQ handler (request_irq) for
the dev->irq whenever the guest writes PCI_COMMAND_MEMORY
(or PCI_COMMAND_IO) to the PCI_COMMAND register. This is
done in case the device has legacy interrupts the GSI line
is shared by the backend devices.
To subvert the backend the guest needs to make the backend
to change the dev->irq from the GSI to the MSI interrupt line,
make the backend allocate an interrupt handler, and then command
the backend to free the MSI interrupt and hit the BUG_ON.
Since the backend only calls 'request_irq' when the guest
writes to the PCI_COMMAND register the guest needs to call
XEN_PCI_OP_enable_msi before any other operation. This will
cause the generic MSI code to setup an MSI entry and
populate dev->irq with the new PIRQ value.
Then the guest can write to PCI_COMMAND PCI_COMMAND_MEMORY
and cause the backend to setup an IRQ handler for dev->irq
(which instead of the GSI value has the MSI pirq). See
'xen_pcibk_control_isr'.
Then the guest disables the MSI: XEN_PCI_OP_disable_msi
which ends up triggering the BUG_ON condition in 'free_msi_irqs'
as there is an IRQ handler for the entry->irq (dev->irq).
Note that this cannot be done using MSI-X as the generic
code does not over-write dev->irq with the MSI-X PIRQ values.
The patch inhibits setting up the IRQ handler if MSI or
MSI-X (for symmetry reasons) code had been called successfully.
P.S.
Xen PCIBack when it sets up the device for the guest consumption
ends up writting 0 to the PCI_COMMAND (see xen_pcibk_reset_device).
XSA-120 addendum patch removed that - however when upstreaming said
addendum we found that it caused issues with qemu upstream. That
has now been fixed in qemu upstream.
This is part of XSA-157
CC: stable@vger.kernel.org Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit a396f3a210c3a61e94d6b87ec05a75d0be2a60d0) Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Mon, 2 Nov 2015 23:07:44 +0000 (18:07 -0500)]
xen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or MSI-X enabled
The guest sequence of:
a) XEN_PCI_OP_enable_msix
b) XEN_PCI_OP_enable_msix
results in hitting an NULL pointer due to using freed pointers.
The device passed in the guest MUST have MSI-X capability.
The a) constructs and SysFS representation of MSI and MSI groups.
The b) adds a second set of them but adding in to SysFS fails (duplicate entry).
'populate_msi_sysfs' frees the newly allocated msi_irq_groups (note that
in a) pdev->msi_irq_groups is still set) and also free's ALL of the
MSI-X entries of the device (the ones allocated in step a) and b)).
The unwind code: 'free_msi_irqs' deletes all the entries and tries to
delete the pdev->msi_irq_groups (which hasn't been set to NULL).
However the pointers in the SysFS are already freed and we hit an
NULL pointer further on when 'strlen' is attempted on a freed pointer.
The patch adds a simple check in the XEN_PCI_OP_enable_msix to guard
against that. The check for msi_enabled is not stricly neccessary.
This is part of XSA-157
CC: stable@vger.kernel.org Reviewed-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 5e0ce1455c09dd61d029b8ad45d82e1ac0b6c4c9) Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or MSI-X enabled
The guest sequence of:
a) XEN_PCI_OP_enable_msi
b) XEN_PCI_OP_enable_msi
c) XEN_PCI_OP_disable_msi
results in hitting an BUG_ON condition in the msi.c code.
The MSI code uses an dev->msi_list to which it adds MSI entries.
Under the above conditions an BUG_ON() can be hit. The device
passed in the guest MUST have MSI capability.
The a) adds the entry to the dev->msi_list and sets msi_enabled.
The b) adds a second entry but adding in to SysFS fails (duplicate entry)
and deletes all of the entries from msi_list and returns (with msi_enabled
is still set). c) pci_disable_msi passes the msi_enabled checks and hits:
BUG_ON(list_empty(dev_to_msi_list(&dev->dev)));
and blows up.
The patch adds a simple check in the XEN_PCI_OP_enable_msi to guard
against that. The check for msix_enabled is not stricly neccessary.
This is part of XSA-157.
CC: stable@vger.kernel.org Reviewed-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 56441f3c8e5bd45aab10dd9f8c505dd4bec03b0d) Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Mon, 16 Nov 2015 17:40:48 +0000 (12:40 -0500)]
xen/pciback: Save xen_pci_op commands before processing it
Double fetch vulnerabilities that happen when a variable is
fetched twice from shared memory but a security check is only
performed the first time.
The xen_pcibk_do_op function performs a switch statements on the op->cmd
value which is stored in shared memory. Interestingly this can result
in a double fetch vulnerability depending on the performed compiler
optimization.
This patch fixes it by saving the xen_pci_op command before
processing it. We also use 'barrier' to make sure that the
compiler does not perform any optimization.
This is part of XSA155.
CC: stable@vger.kernel.org Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 8135cf8b092723dbfcc611fe6fdcb3a36c9951c5) Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen-blkback: read from indirect descriptors only once
Since indirect descriptors are in memory shared with the frontend, the
frontend could alter the first_sect and last_sect values after they have
been validated but before they are recorded in the request. This may
result in I/O requests that overflow the foreign page, possibly
overwriting local pages when the I/O request is executed.
When parsing indirect descriptors, only read first_sect and last_sect
once.
David Vrabel [Fri, 30 Oct 2015 15:17:06 +0000 (15:17 +0000)]
xen-netback: use RING_COPY_REQUEST() throughout
Instead of open-coding memcpy()s and directly accessing Tx and Rx
requests, use the new RING_COPY_REQUEST() that ensures the local copy
is correct.
This is more than is strictly necessary for guest Rx requests since
only the id and gref fields are used and it is harmless if the
frontend modifies these.
This is part of XSA155.
CC: stable@vger.kernel.org Reviewed-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 68a33bfd8403e4e22847165d149823a2e0e67c9c) Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
David Vrabel [Fri, 30 Oct 2015 15:16:01 +0000 (15:16 +0000)]
xen-netback: don't use last request to determine minimum Tx credit
The last from guest transmitted request gives no indication about the
minimum amount of credit that the guest might need to send a packet
since the last packet might have been a small one.
Instead allow for the worst case 128 KiB packet.
This is part of XSA155.
CC: stable@vger.kernel.org Reviewed-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 0f589967a73f1f30ab4ac4dd9ce0bb399b4d6357) Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
David Vrabel [Fri, 30 Oct 2015 14:58:08 +0000 (14:58 +0000)]
xen: Add RING_COPY_REQUEST()
Using RING_GET_REQUEST() on a shared ring is easy to use incorrectly
(i.e., by not considering that the other end may alter the data in the
shared ring while it is being inspected). Safe usage of a request
generally requires taking a local copy.
Provide a RING_COPY_REQUEST() macro to use instead of
RING_GET_REQUEST() and an open-coded memcpy(). This takes care of
ensuring that the copy is done correctly regardless of any possible
compiler optimizations.
Use a volatile source to prevent the compiler from reordering or
omitting the copy.
This is part of XSA155.
CC: stable@vger.kernel.org Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 454d5d882c7e412b840e3c99010fe81a9862f6fb) Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Santosh Shilimkar [Fri, 18 Dec 2015 02:47:33 +0000 (18:47 -0800)]
Merge branch 'topic/uek-4.1/upstream-cherry-picks' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* 'topic/uek-4.1/upstream-cherry-picks' of git://ca-git.us.oracle.com/linux-uek:
KEYS: Fix crash when attempt to garbage collect an uninstantiated keyring
KEYS: Fix race between key destruction and finding a keyring by name
David Howells [Thu, 15 Oct 2015 16:21:37 +0000 (17:21 +0100)]
KEYS: Fix crash when attempt to garbage collect an uninstantiated keyring
The following sequence of commands:
i=`keyctl add user a a @s`
keyctl request2 keyring foo bar @t
keyctl unlink $i @s
tries to invoke an upcall to instantiate a keyring if one doesn't already
exist by that name within the user's keyring set. However, if the upcall
fails, the code sets keyring->type_data.reject_error to -ENOKEY or some
other error code. When the key is garbage collected, the key destroy
function is called unconditionally and keyring_destroy() uses list_empty()
on keyring->type_data.link - which is in a union with reject_error.
Subsequently, the kernel tries to unlink the keyring from the keyring names
list - which oopses like this:
David Howells [Fri, 25 Sep 2015 15:30:08 +0000 (16:30 +0100)]
KEYS: Fix race between key destruction and finding a keyring by name
There appears to be a race between:
(1) key_gc_unused_keys() which frees key->security and then calls
keyring_destroy() to unlink the name from the name list
(2) find_keyring_by_name() which calls key_permission(), thus accessing
key->security, on a key before checking to see whether the key usage is 0
(ie. the key is dead and might be cleaned up).
Fix this by calling ->destroy() before cleaning up the core key data -
including key->security.
Santosh Shilimkar [Sun, 13 Dec 2015 07:30:24 +0000 (23:30 -0800)]
Merge branch 'topic/uek-4.1/secureboot' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* 'topic/uek-4.1/secureboot' of git://ca-git.us.oracle.com/linux-uek:
conditionalize Secure Boot initialization on x86 platform
x86/efi: Set securelevel when loaded without efi stub
Santosh Shilimkar [Sun, 13 Dec 2015 07:29:57 +0000 (23:29 -0800)]
Merge branch 'topic/uek-4.1/upstream-cherry-picks' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* 'topic/uek-4.1/upstream-cherry-picks' of git://ca-git.us.oracle.com/linux-uek:
KVM: svm: unconditionally intercept #DB
KVM: x86: work around infinite loop in microcode when #AC is delivered
A previous commit, bbb87ba5690c72821614acdd465c24eeb99fa923,
introduced a call to efi_secure_boot_init() into the start_kernel()
sequence. The new call was not conditionalized based on platform.
Secure Boot is only implmented on the x86 platform; the definition
of efi_secure_boot_init() is in arch/x86/platform/efi/efi.c. On
non-x86 architectures, this caused a implicit-declaration error and
broke the build.
This commit wraps the call in #ifdef CONFIG_X86/#endif to eliminate
the error.
With UEFI Secure Boot enabled and securelevel set, after a kernel is loaded
using kexec and booted, securelevel is disabled. With the securelevel patch
set, the state of UEFI Secure Boot is queried when booted via the efi stub, but
kexec does not use the efi stub.
To allow kernels which are not loaded through the efi stub to properly set
securelevel as well, add a new init routine to start_kernel() to query the
state of UEFI Secure Boot and enable securelevel if needed.
Taken from https://bugzilla.redhat.com/attachment.cgi?id=1052836 .
Signed-off-by: Linn Crosetto <linn@hp.com> Signed-off-by: Dan Duval <dan.duval@oracle.com>
Eric Northup [Tue, 3 Nov 2015 17:03:53 +0000 (18:03 +0100)]
KVM: x86: work around infinite loop in microcode when #AC is delivered
It was found that a guest can DoS a host by triggering an infinite
stream of "alignment check" (#AC) exceptions. This causes the
microcode to enter an infinite loop where the core never receives
another interrupt. The host kernel panics pretty quickly due to the
effects (CVE-2015-5307).
Signed-off-by: Eric Northup <digitaleric@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Orabug: 22333632
CVE: CVE-2015-5307
mainline commit 54a20552e1eae07aa240fa370a0293e006b5faed Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Santosh Shilimkar [Fri, 11 Dec 2015 20:21:39 +0000 (12:21 -0800)]
Merge branch 'topic/uek-4.1/rpm-build' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* 'topic/uek-4.1/rpm-build' of git://ca-git.us.oracle.com/linux-uek:
uek-rpm: module signing key verification on sparc
uek-rpm: rebuild module kabi list
Santosh Shilimkar [Fri, 11 Dec 2015 20:21:23 +0000 (12:21 -0800)]
Merge branch 'topic/uek-4.1/drivers' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* 'topic/uek-4.1/drivers' of git://ca-git.us.oracle.com/linux-uek:
scsi: Fix a bdi reregistration race
i40e: Fix for recursive RTNL lock during PROMISC change
Santosh Shilimkar [Fri, 11 Dec 2015 20:21:00 +0000 (12:21 -0800)]
Merge branch 'topic/uek-4.1/stable-cherry-picks' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* 'topic/uek-4.1/stable-cherry-picks' of git://ca-git.us.oracle.com/linux-uek:
Revert "netlink: Fix autobind race condition that leads to zero port ID"
Revert "netlink: Replace rhash_portid with bound"
See also patch "block: destroy bdi before blockdev is unregistered"
(commit ID 6cd18e711dd8).
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: James Bottomley <JBottomley@Odin.com>
(cherry picked from commit bf2cf3baa20b0a6cd2d08707ef05dc0e992a8aa0)
Orabug: 22250360 Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>
Conflicts:
drivers/scsi/scsi_sysfs.c
i40e: Fix for recursive RTNL lock during PROMISC change
The sync_vsi_filters function can be called directly under RTNL
or through the timer subtask without one. This was causing a deadlock.
If sync_vsi_filters is called from a thread which held the lock,
and in another thread the PROMISC setting got changed we would
be executing the PROMISC change in the thread which already held
the lock alongside the other filter update. The PROMISC change
requires a reset if we are on a VEB, which requires it to be called
under RTNL.
Earlier the driver would call reset for PROMISC change without
checking if we were already under RTNL and would try to grab it
causing a deadlock. This patch changes the flow to see if we are
already under RTNL before trying to grab it.
Add udata argument to shared pd interface alloc_shpd()
consistent with evolution of other similar ib_core
interfaces so providers that wish to support it can use it.
For providers (like current Mellanox driver code) that
do not expect user user data, we assert a warning.
That commit, along with d48623677191e0f035d7afd344f92cf880b01f8e, have
been shown to produce a hang in the Oracle Real Application Clusters
(RAC) silent-installation procedure.
Ethan Zhao [Wed, 9 Dec 2015 10:50:41 +0000 (02:50 -0800)]
ixgbe: make a workaround to tx hang issue under dom
report 1 tx queue to net core to workaround the tx hang
issue reported in Xen environment.
The change is only limited to dom0 and baremetal is left unchanged.
Orabug: 22171500 Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com> Signed-off-by: Guru Anbalagane <guru.anbalagane@oracle.com>
V2: The original version of the patch did not correctly handle placeholder
entries before the range to be deleted. The new check is more specific
and only matches placeholders at the start of range.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit fd2e0def3e0954be0453b625ce12c48e4a83bc70) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Dmitry identified a potential memory leak in the routine region_chg, where
a region descriptor is not free'ed on an error path.
However, the root cause for the above memory leak resides in region_del.
In this specific case, a "placeholder" entry is created in region_chg.
The associated page allocation fails, and the placeholder entry is left in
the reserve map. This is "by design" as the entry should be deleted when
the map is released. The bug is in the region_del routine which is used
to delete entries within a specific range (and when the map is released).
region_del did not handle the case where a placeholder entry exactly
matched the start of the range range to be deleted. In this case, the
entry would not be deleted and leaked. The fix is to take these special
placeholder entries into account in region_del.
The region_chg error path leak is also fixed.
Fixes: feba16e25a57 ("mm/hugetlb: add region_del() to delete a specific range of entries") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: <stable@vger.kernel.org> [4.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 8ea87f6c208bd070a2701b53d2479a415f4159f4) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
port from RH. https://lkml.org/lkml/2015/7/9/724
fs] vfs: Prevent syncing frozen file system (Lukas Czerner)
[12434041241791]
port from RH. https://lkml.org/lkml/2015/7/9/487
From Lukas Czerner <>
Subject [RFC][PATCH] fs: Prevent syncing frozen file system
Date Thu, 9 Jul 2015 19:45:45 +0200
Currently we can end up in a deadlock because of broken
sb_start_write -> s_umount ordering.
The race goes like this:
- write the file
- unlink the file - final_iput will not be calles as file is opened
- freeze the file system
- Now simultaneously close the file and call sync (or syncfs on that
particular file system). Sync will get to wait_sb_inodes() where it will
grab the referece to the inode (__iget()) and later to call iput().
If we manage to close the file and drop the reference in between those
calls sync will attempt to do a iput_final() because the inode is now
unlinked and we're holding the last reference to it. This will
however block on a frozen file system (ext4_delete_inode for
example).
Note that I've not been able to reproduce the issue, I've only seen this
happen once. However with some instrumentation (like msleep() in the
wait_sb_inodes() it can be achieved.
Fix this by properly doing sb_start_write/sb_end_write to prevent us
from fsfreeze.
Note that with this patch syncfs will block on the frozen file system
which is probably ok, but sync will block if any file system happens to
be frozen - not sure if that's a problem, but it's certainly different
from what we've been used to.
commit 8f1eb48758aa ("ocfs2: fix umask ignored issue") introduced an issue,
SGID of sub dir was not inherited from its parents dir. It is because SGID
is set into "inode->i_mode" in ocfs2_get_init_inode(), but is overwritten
by "mode" which don't have SGID set later.
Jerry Snitselaar [Thu, 31 Oct 2013 03:03:21 +0000 (20:03 -0700)]
kbuild: Set objects.builtin dependency to bzImage for CONFIG_CTF
For x86 architecture use dependency on bzImage target instead of
vmlinux, otherwise the linux_banner in the debug vmlinux and the
vmlinuz that is shipped are different because vmlinux is now getting
rebuilt for ctf.
If the macaddr is not from Open Firmwre or IDPROM (i.e., defaults
macaddr was used) then do not call i40e_macaddr_init again, else
you will get a driver init failure like this:
Todd Vierling [Tue, 1 Dec 2015 20:59:59 +0000 (15:59 -0500)]
dtrace: ensure return value of access_process_vm() is > 0
If access_process_vm() returns a value <= 0, the loop intended to replace
NULs with whitespace in the ps string could turn into a memory sprayer
by incrementing its loop counter all the way to (uintptr_t)-1. This
often shows up as crashes with PTEs showing 0x20 stuffed inside, e.g.:
Jamie Iles [Thu, 12 Nov 2015 12:29:25 +0000 (12:29 +0000)]
ksplice: correctly clear garbage on signal handling.
The test for _TIF_SIGPENDING was inverted, and so the stack was being
cleared when there were no signals pending rather than signals pending.
Correctly test _TIF_SIGPENDING so that the freezer can be used to clear
the stack of garbage when applying Ksplice updates.
Santosh Shilimkar [Thu, 26 Nov 2015 04:08:58 +0000 (20:08 -0800)]
Merge branch 'topic/uek-4.1/upstream-cherry-picks' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* 'topic/uek-4.1/upstream-cherry-picks' of git://ca-git.us.oracle.com/linux-uek:
Revert "nfs: take extra reference to fl->fl_file when running a LOCKU operation"
Todd Vierling [Mon, 23 Nov 2015 15:08:44 +0000 (10:08 -0500)]
NFSoRDMA: for local permissions, pull lkey value from the correct ib_mr
When not using local_dma_lkey, fmr_op_open() fetches a ib_mr associated
with local write permission. This value is stuffed into the returned
rpcrdma_ia, but ia->ri_dma_mr->lkey was accessed _before_ the ia->ri_dma_mr
was updated.
Fix this by pulling lkey directly from the newly fetched ib_mr rather than
the value previously stored in ia->ri_dma_mr.
Santosh Shilimkar [Fri, 20 Nov 2015 16:34:23 +0000 (08:34 -0800)]
Merge branch 'topic/uek-4.1/rpm-build' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* 'topic/uek-4.1/rpm-build' of git://ca-git.us.oracle.com/linux-uek:
uek-rpm: builds: Enable kabi check
uek-rpm: builds: generate module kabi files
uek-rpm: builds: add kabi whitelist debug version
Santosh Shilimkar [Fri, 20 Nov 2015 02:03:03 +0000 (18:03 -0800)]
Merge branch 'topic/uek-4.1/rpm-build' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* 'topic/uek-4.1/rpm-build' of git://ca-git.us.oracle.com/linux-uek:
uek-rpm: configs: change the x86_64 default governor to ondemand
uek-rpm: configs: sync up the EFIVAR_FS between ol6 and ol7
uek-rpm: use the latest 0.5 version of linux-firmware
Santosh Shilimkar [Fri, 20 Nov 2015 02:02:55 +0000 (18:02 -0800)]
Merge branch 'topic/uek-4.1/upstream-cherry-picks' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* 'topic/uek-4.1/upstream-cherry-picks' of git://ca-git.us.oracle.com/linux-uek:
mm-hugetlbfs-fix-bugs-in-fallocate-hole-punch-of-areas-with-holes-v3
mm/hugetlbfs: fix bugs in fallocate hole punch of areas with holes
btrfs: Print Warning only if ENOSPC_DEBUG is enabled
rtnetlink: RTEXT_FILTER_SKIP_STATS support to avoid dumping inet/inet6 stats
Santosh Shilimkar [Fri, 20 Nov 2015 02:02:46 +0000 (18:02 -0800)]
Merge branch 'topic/uek-4.1/uek-carry' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* 'topic/uek-4.1/uek-carry' of git://ca-git.us.oracle.com/linux-uek:
pci: Limit VPD length for megaraid_sas adapter
KABI Padding to allow future extensions
Santosh Shilimkar [Fri, 20 Nov 2015 02:02:39 +0000 (18:02 -0800)]
Merge branch 'topic/uek-4.1/dtrace' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* 'topic/uek-4.1/dtrace' of git://ca-git.us.oracle.com/linux-uek:
dtrace: fire proc:::signal-send for queued signals too
dtrace: correct signal-handle probe semantics
dtrace: remove trailing space in psargs
V3:
Add more descriptive comments and minor improvements as suggested by
Naoya Horiguchi
v2:
Make remove_inode_hugepages simpler after verifying truncate can not
race with page faults here.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Hillf Danton" <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit d4d7420e86e29267ff2e20226cbcadcf7e6eab8c) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Hugh Dickins pointed out problems with the new hugetlbfs fallocate hole
punch code. These problems are in the routine remove_inode_hugepages and
mostly occur in the case where there are holes in the range of pages to be
removed. These holes could be the result of a previous hole punch or
simply sparse allocation. The current code could access pages outside the
specified range.
remove_inode_hugepages handles both hole punch and truncate operations.
Page index handling was fixed/cleaned up so that the loop index always
matches the page being processed. The code now only makes a single pass
through the range of pages as it was determined page faults could not race
with truncate. A cond_resched() was added after removing up to
PAGEVEC_SIZE pages.
Some totally unnecessary code in hugetlbfs_fallocate() that remained from
early development was also removed.
Tested with fallocate tests submitted here:
http://librelist.com/browser//libhugetlbfs/2015/6/25/patch-tests-add-tests-for-fallocate-system-call/
And, some ftruncate tests under development
Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Hillf Danton" <hillf.zj@alibaba-inc.com> Cc: <stable@vger.kernel.org> [4.3] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit ef9a2b7a46755b6b2d4ab522c2ffa53c6e1a0729) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>