www.infradead.org Git - users/jedix/linux-maple.git/log

btrfs: fix deadlock with extent-same and readpage

->readpage() does page_lock() before extent_lock(), we do the opposite in
extent-same. We want to reverse the order in btrfs_extent_same() but it's
not quite straightforward since the page locks are taken inside btrfs_cmp_data().

So I split btrfs_cmp_data() into 3 parts with a small context structure that
is passed between them. The first, btrfs_cmp_data_prepare() gathers up the
pages needed (taking page lock as required) and puts them on our context
structure. At this point, we are safe to lock the extent range. Afterwards,
we use btrfs_cmp_data() to do the data compare as usual and btrfs_cmp_data_free()
to clean up our context.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
(Cherry picked from commit f441460202cb787c49963bcc1f54cb48c52f7512)

Orabug: 26251039

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: John Haxby <john.haxby@oracle.com>

btrfs: pass unaligned length to btrfs_cmp_data()

In the case that we dedupe the tail of a file, we might expand the dedupe
len out to the end of our last block. We don't want to compare data past
i_size however, so pass the original length to btrfs_cmp_data().

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
(Cherry picked from commit 207910ddeeda38fd54544d94f8c8ca5a9632cc25)

Orabug: 26251039

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: John Haxby <john.haxby@oracle.com>

xen: fix bio vec merging

The current test for bio vec merging is not fully accurate and can be
tricked into merging bios when certain grant combinations are used.
The result of these malicious bio merges is a bio that extends past
the memory page used by any of the originating bios.

Take into account the following scenario, where a guest creates two
grant references that point to the same mfn, ie: grant 1 -> mfn A,
grant 2 -> mfn A.

These references are then used in a PV block request, and mapped by
the backend domain, thus obtaining two different pfns that point to
the same mfn, pfn B -> mfn A, pfn C -> mfn A.

If those grants happen to be used in two consecutive sectors of a disk
IO operation becoming two different bios in the backend domain, the
checks in xen_biovec_phys_mergeable will succeed, because bfn1 == bfn2
(they both point to the same mfn). However due to the bio merging,
the backend domain will end up with a bio that expands past mfn A into
mfn A + 1.

Fix this by making sure the check in xen_biovec_phys_mergeable takes
into account the offset and the length of the bio, this basically
replicates whats done in __BIOVEC_PHYS_MERGEABLE using mfns (bus
addresses). While there also remove the usage of
__BIOVEC_PHYS_MERGEABLE, since that's already checked by the callers
of xen_biovec_phys_mergeable.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Orabug: 26564183
CVE: CVE-2017-12134

Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com>

ovl: verify upper dentry in ovl_remove_and_whiteout()

Orabug: 26175588

The upper dentry may become stale before we call ovl_lock_rename_workdir.
For example, someone could (mistakenly or maliciously) manually unlink(2)
it directly from upperdir.

To ensure it is not stale, let's lookup it after ovl_lock_rename_workdir
and and check if it matches the upper dentry.

Essentially, it is the same problem and similar solution as in
commit 11f3710417d0 ("ovl: verify upper dentry before unlink and rename").

Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
(cherry picked from commit cfc9fde0b07c3b44b570057c5f93dda59dca1c94)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ovl: use O_LARGEFILE in ovl_copy_up()

Orabug: 26619890

Open the lower file with O_LARGEFILE in ovl_copy_up().

Pass O_LARGEFILE unconditionally in ovl_copy_up_data() as it's purely for
catching 32-bit userspace dealing with a file large enough that it'll be
mishandled if the application isn't aware that there might be an integer
overflow. Inside the kernel, there shouldn't be any problems.

Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org> # v3.18+
(cherry picked from commit 0480334fa60488d12ae101a02d7d9e1a3d03d7dd)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

xfs: toggle readonly state around xfs_log_mount_finish

When we do log recovery on a readonly mount, unlinked inode
processing does not happen due to the readonly checks in
xfs_inactive(), which are trying to prevent any I/O on a
readonly mount.

This is misguided - we do I/O on readonly mounts all the time,
for consistency; for example, log recovery. So do the same
RDONLY flag twiddling around xfs_log_mount_finish() as we
do around xfs_log_mount(), for the same reason.

This all cries out for a big rework but for now this is a
simple fix to an obvious problem.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Orabug: 26630113

Link: https://patchwork.kernel.org/patch/9857173/
move readonly checking into the if statement to apply to the current kernel

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

xfs: write unmount record for ro mounts

There are dueling comments in the xfs code about intent
for log writes when unmounting a readonly filesystem.

In xfs_mountfs, we see the intent:

/*
* Now the log is fully replayed, we can transition to full read-only
* mode for read-only mounts. This will sync all the metadata and clean
* the log so that the recovery we just performed does not have to be
* replayed again on the next mount.
*/

and it calls xfs_quiesce_attr(), but by the time we get to
xfs_log_unmount_write(), it returns early for a RDONLY mount:

* Don't write out unmount record on read-only mounts.

Because of this, sequential ro mounts of a filesystem with
a dirty log will replay the log each time, which seems odd.

Fix this by writing an unmount record even for RO mounts, as long
as norecovery wasn't specified (don't write a clean log record
if a dirty log may still be there!) and the log device is
writable.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Orabug: 26630113

Link: https://patchwork.kernel.org/patch/9857169
Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

xfs: fix eofblocks race with file extending async dio writes

It's possible for post-eof blocks to end up being used for direct I/O
writes. dio write performs an upfront unwritten extent allocation, sends
the dio and then updates the inode size (if necessary) on write
completion. If a file release occurs while a file extending dio write is
in flight, it is possible to mistake the post-eof blocks for speculative
preallocation and incorrectly truncate them from the inode. This means
that the resulting dio write completion can discover a hole and allocate
new blocks rather than perform unwritten extent conversion.

This requires a strange mix of I/O and is thus not likely to reproduce
in real world workloads. It is intermittently reproduced by generic/299.
The error manifests as an assert failure due to transaction overrun
because the aforementioned write completion transaction has only
reserved enough blocks for btree operations:

XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, \
file: fs/xfs//xfs_trans.c, line: 309

The root cause is that xfs_free_eofblocks() uses i_size to truncate
post-eof blocks from the inode, but async, file extending direct writes
do not update i_size until write completion, long after inode locks are
dropped. Therefore, xfs_free_eofblocks() effectively truncates the inode
to the incorrect size.

Update xfs_free_eofblocks() to serialize against dio similar to how
extending writes are serialized against i_size updates before post-eof
block zeroing. Specifically, wait on dio while under the iolock. This
ensures that dio write completions have updated i_size before post-eof
blocks are processed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Orabug: 26588811

(backport upstream commit e4229d6b0bc9280f29624faf170cf76a9f1ca60e)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

NVMe: Allocate queues only for online cpus

The driver previously requested allocating queues for the total possible
number of CPUs so that blk-mq could rebalance these if CPUs were added
after initialization. The number of hardware contexts can now be changed
at runtime, so we only need to allocate the number of online queues
since we can add more later.

Suggested-by: Jeff Lien <jeff.lien@hgst.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Orabug: 24675382
(cherry picked from commit 2800b8e7d9dfca1fd9d044dcf7a046b5de5a7239)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Conflicts:
drivers/nvme/host/pci.c

Reviewed-by: Govinda Tatti<Govinda.Tatti@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

KEYS: fix dereferencing NULL payload with nonzero length

sys_add_key() and the KEYCTL_UPDATE operation of sys_keyctl() allowed a
NULL payload with nonzero length to be passed to the key type's
->preparse(), ->instantiate(), and/or ->update() methods. Various key
types including asymmetric, cifs.idmap, cifs.spnego, and pkcs7_test did
not handle this case, allowing an unprivileged user to trivially cause a
NULL pointer dereference (kernel oops) if one of these key types was
present. Fix it by doing the copy_from_user() when 'plen' is nonzero
rather than when '_payload' is non-NULL, causing the syscall to fail
with EFAULT as expected when an invalid buffer is specified.

Cc: stable@vger.kernel.org # 2.6.10+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Orabug: 26591890
(cherry picked from commit 5649645d725c73df4302428ee4e02c869248b4c5)
Signed-off-by: Todd Vierling <todd.vierling@oracle.com>

IB/mlx4: Fix possible vl/sl field mismatch in LRH header in QP1 packets

In MLX qp packets, the LRH (built by the driver) has both a VL field
and an SL field. When building a QP1 packet, the VL field should
reflect the SLtoVL mapping and not arbitrarily contain zero (as is
done now). This bug causes credit problems in IB switches at
high rates of QP1 packets.

The fix is to cache the SL to VL mapping in the driver, and look up
the VL mapped to the SL provided in the send request when sending
QP1 packets.

For FW versions which support generating a port_management_config_change
event with subtype sl-to-vl-table-change, the driver uses that event
to update its sl-to-vl mapping cache.  Otherwise, the driver snoops
incoming SMP mads to update the cache.

There remains the case where the FW is running in secure-host mode
(so no QP0 packets are delivered to the driver), and the FW does not
generate the sl2vl mapping change event. To support this case, the
driver updates (via querying the FW) its sl2vl mapping cache when
running in secure-host mode when it receives either a Port Up event
or a client-reregister event (where the port is still up, but there
may have been an opensm failover).
OpenSM modifies the sl2vl mapping before Port Up and Client-reregister
events occur, so if there is a mapping change the driver's cache will
be properly updated.

Fixes: 225c7b1feef1 ("IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
[ Original fix ported from upstream commit fd10ed8 modified with
  changes from merge commit b9044ac and modifications in
  dump_dev_cap_flags2() likely to be made upstream for correctness
  (and verified with private communication with Mellanox) ]

Orabug: 26198210

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

Revert "IB/mlx4: Suppress warning for not handled portmgmt event subtype"

This reverts commit cef258494a818e96db0e4ed40ddd2c52e6084846.

Revert needed to add code for proper handling of the port
management change event in a subsequent commit (instead
of a workaround that suppresses an error message for
an unhandled event).

Orabug: 26198210

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

scsi: megaraid_sas: handle dma_addr_t right on 32-bit

Orabug: 26608922

When building with a dma_addr_t that is different from pointer size, we
get this warning:

drivers/scsi/megaraid/megaraid_sas_fusion.c: In function 'megasas_make_prp_nvme':
drivers/scsi/megaraid/megaraid_sas_fusion.c:1654:17: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]

It's better to not pretend that the dma address is a pointer and instead
use a dma_addr_t consistently.

Fixes: 33203bc4d61b ("scsi: megaraid_sas: NVME fast path io support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Sumit Saxena <sumit.saxena@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit d1da522fb8a70b8c527d4ad15f9e62218cc00f2c)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

scsi: megaraid_sas: NVME fast path io support

Orabug: 26608922

This patch provide true fast path IO support. Driver creates PRP for
NVME drives and send Fast Path for performance. Certain h/w requirement
needs to be taken care in driver.

Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 33203bc4d61b33f1f7bb736eac0c6fdd20b92397)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
drivers/scsi/megaraid/megaraid_sas.h
drivers/scsi/megaraid/megaraid_sas_fp.c

scsi: megaraid_sas: NVME interface target prop added

Orabug: 26608922

This patch fetch true values of NVME property from FW using New DCMD
interface MR_DCMD_DEV_GET_TARGET_PROP

Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 96188a89cc6d5ad3a0a5b7a6c4abc9f4a6de7678)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

scsi: megaraid_sas: NVME Interface detection and prop settings

Orabug: 26608922

Adding detection logic for NVME device attached behind Ventura
controller.  Driver set HostPageSize in IOC_INIT frame to inform about
page size for NVME devices.  Firmware reports NVME page size to the
driver.  PD INFO DCMD provide new interface type NVME_PD. Driver set
property of NVME device.

Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ Upstream commit 15dd03811d99dcf828f4eeb2c2b6a02ddc1201c7 ]
[ Resolved the use of blk_queue_virt_boundary just like broadcom driver version ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
drivers/scsi/megaraid/megaraid_sas.h

scsi: megaraid_sas: Use synchronize_irq to wait for IRQs to complete

Orabug: 26608922

FIX - Do not use random delay to synchronize with IRQ. Use kernel API.

Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ Upstream commit 29206da1490a7065e8a03ec43f6de60c5c978cae ]
[ Added drivers/scsi/megaraid/kcompat.h to resolve pci_irq_vector symbol ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

fs/fuse: fuse mount can cause panic with no memory numa node

Orabug: 26591421

Host panics with the below stack trace:-

[ 2509.835362] BUG: unable to handle kernel paging request at
0000000000001e08
[ 2509.843155] IP: [<ffffffff8119e1a7>]
__alloc_pages_nodemask+0xb7/0xa10
[ 2509.850466] PGD 7f5bbf5067 PUD 7f5c492067 PMD 0
..
..
[ 2510.108923] Call Trace:
[ 2510.111658]  [<ffffffff81022075>] ? native_sched_clock+0x35/0xa0
[ 2510.118360]  [<ffffffff810220e9>] ? sched_clock+0x9/0x10
[ 2510.124291]  [<ffffffff810b9ad5>] ? sched_clock_cpu+0x85/0xc0
[ 2510.130692]  [<ffffffff810b6227>] ? try_to_wake_up+0x47/0x370
[ 2510.137108]  [<ffffffff811f0c43>] ? deactivate_slab+0x383/0x400
[ 2510.143712]  [<ffffffff811f1e37>] new_slab+0xa7/0x460
[ 2510.149350]  [<ffffffff81733cc0>] __slab_alloc+0x317/0x477
[ 2510.155473]  [<ffffffffa0380336>] ? fuse_conn_init+0x236/0x420 [fuse]
[ 2510.162664]  [<ffffffff8146184c>] ? extract_entropy+0x14c/0x330
[ 2510.169260]  [<ffffffff814622b0>] ? get_random_bytes+0x40/0xa0
[ 2510.175769]  [<ffffffff811f2c75>] kmem_cache_alloc_node_trace+0x95/0x2b0
[ 2510.183248]  [<ffffffffa0380336>] fuse_conn_init+0x236/0x420 [fuse]
[ 2510.190241]  [<ffffffffa03808c0>] fuse_fill_super+0x3a0/0x700 [fuse]
[ 2510.197330]  [<ffffffff811f30f6>] ? __kmalloc+0x266/0x2c0
[ 2510.203355]  [<ffffffff811a6dcc>] ? register_shrinker+0x3c/0xa0
[ 2510.209961]  [<ffffffff81216728>] ? sget+0x3b8/0x3f0
[ 2510.215500]  [<ffffffff81215880>] ? get_anon_bdev+0x120/0x120
[ 2510.221911]  [<ffffffffa0380520>] ? fuse_conn_init+0x420/0x420 [fuse]
[ 2510.229096]  [<ffffffff8121686d>] mount_nodev+0x4d/0xb0
[ 2510.234927]  [<ffffffffa037ec38>] fuse_mount+0x18/0x20 [fuse]
[ 2510.241337]  [<ffffffff812173e9>] mount_fs+0x39/0x180
[ 2510.246975]  [<ffffffff8123366b>] vfs_kern_mount+0x6b/0x110
[ 2510.253191]  [<ffffffff81236411>] do_mount+0x251/0xcf0
[ 2510.258922]  [<ffffffff812371f2>] SyS_mount+0xa2/0x110
[ 2510.264653]  [<ffffffff81028666>] ? syscall_trace_leave+0xc6/0x150
[ 2510.271551]  [<ffffffff8173d96e>] system_call_fastpath+0x12/0x71

This commit reverts the changes done in 937287c2e4ab ("fs/fuse: Fix for
correct number of numa nodes") and fixes the original problem by calling
kmalloc_node/kmem_cache_alloc_node, only on the online numa nodes. For the
offline nodes, kmalloc or kmem_cache_alloc will be used for allocation.
This problem happens only with slub allocator because it does not do
fallback allocation if node is offline, unlike slab. And for that reason,
the same behavior is emulated here.

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Ashish Samant<ashish.samant@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>

Fix regression which breaks DFS mounting

Orabug: 26591404

Patch a6b5058 results in -EREMOTE returned by is_path_accessible() in
cifs_mount() to be ignored which breaks DFS mounting.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
(cherry picked from commit d171356ff11ab1825e456dfb979755e01b3c54a1)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

ol7/spec: sync up linux-firmware version for ol74

Orabug: 26567308
Orabug: 26567283

Because linux-firmware package was updated to version
20170803-56.git7d2c913d.0.1
We need to sync up the version required by QU5 and QU6
kernel.

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>

ol6/spec: sync up linux-firmware version for ol6

Orabug: 26586911
Orabug: 26586927

Because linux-firmware package was updated to version
20170803-56.git7d2c913d.0.1
We need to sync up the version required by QU5 and QU6
kernel.

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>

scsi: scsi_debug: Avoid PI being disabled when TPGS is enabled

It was not possible to enable both T10 PI and TPGS because they share
the same byte in the INQUIRY response. Logically OR the TPGS value
instead of using assignment.

Orabug: 25704090

Reported-by: Ritika Srivastava <ritika.srivastava@oracle.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 70bdf2026d905b8bfa0a455d35018df3e9777a6c)
Signed-off-by: Ritika Srivastava <ritika.srivastava@oracle.com>
Reviewed-by: John Sobecki <john.sobecki@oracle.com>
Conflicts:
drivers/scsi/scsi_debug.c

config: enable QEDI config items for UEK

Orabug: 26519978

Enable QEDI driver in UEK config files

Signed-off-by: Brian Maly <brian.maly@oracle.com>

qedi: add firmware related dependencies

Orabug: 26519978

Add some firmware related code qedi driver needs to build properly

Signed-off-by: Brian Maly <brian.maly@oracle.com>

config: enable QED config items for UEK

Orabug: 26519978

Make QED related config items consistent between all x86 config files.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: qedi: fix KAPI for UEK kernels

Orabug: 26519978

There are some minor differences between our kernels KAPI and what
the vendors patchset expected. Fixup KAPI for UEK4 kernels.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

nfsd: encoders mustn't use unitialized values in error cases

In error cases, lgp->lg_layout_type may be out of bounds; so we
shouldn't be using it until after the check of nfserr.

This was seen to crash nfsd threads when the server receives a LAYOUTGET
request with a large layout type.

GETDEVICEINFO has the same problem.

Reported-by: Ari Kauppi <Ari.Kauppi@synopsys.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
(cherry picked from commit f961e3f2acae94b727380c0b74e2d3954d0edf79)

Orabug: 26376568
CVE: CVE-2017-8797

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

nfsd: fix undefined behavior in nfsd4_layout_verify

UBSAN: Undefined behaviour in fs/nfsd/nfs4proc.c:1262:34
shift exponent 128 is too large for 32-bit type 'int'

Depending on compiler+architecture, this may cause the check for
layout_type to succeed for overly large values (which seems to be the
case with amd64). The large value will be later used in de-referencing
nfsd4_layout_ops for function pointers.

Reported-by: Jani Tuovila <tuovila@synopsys.com>
Signed-off-by: Ari Kauppi <ari@synopsys.com>
[colin.king@canonical.com: use LAYOUT_TYPE_MAX instead of 32]
Cc: stable@vger.kernel.org
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
(cherry picked from commit b550a32e60a4941994b437a8d662432a486235a5)

Orabug: 26376568
CVE: CVE-2017-8797

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
fs/nfsd/nfs4proc.c

ovl: fix workdir creation

Workdir creation fails in latest kernel.

Fix by allowing EOPNOTSUPP as a valid return value from
vfs_removexattr(XATTR_NAME_POSIX_ACL_*). Upper filesystem may not support
ACL and still be perfectly able to support overlayfs.

Reported-by: Martin Ziegler <ziegler@uni-freiburg.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: c11b9fdd6a61 ("ovl: remove posix_acl_default from workdir")
Cc: <stable@vger.kernel.org>
Orabug: 26401569

(backport upstream commit e1ff3dd1ae52cef5b5373c8cc4ad949c2c25a71c)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: listxattr: use strnlen()

Be defensive about what underlying fs provides us in the returned xattr
list buffer. If it's not properly null terminated, bail out with a warning
insead of BUG.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
Orabug: 26401569

(backport upstream commit 7cb35119d067191ce9ebc380a599db0b03cbd9d9)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: copyattr after setting POSIX ACL

Setting POSIX acl may also modify the file mode, so need to copy that up to
the overlay inode.

Reported-by: Eryu Guan <eguan@redhat.com>
Fixes: d837a49bd57f ("ovl: fix POSIX ACL setting")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit ce31513a9114f74fe3e9caa6534d201bdac7238d)
call d_inode to get inode from dentry.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: don't cache acl on overlay layer

Some operations (setxattr/chmod) can make the cached acl stale.  We either
need to clear overlay's acl cache for the affected inode or prevent acl
caching on the overlay altogether.  Preventing caching has the following
advantages:

- no double caching, less memory used

- overlay cache doesn't go stale when fs clears it's own cache

Possible disadvantage is performance loss.  If that becomes a problem
get_acl() can be optimized for overlayfs.

This patch disables caching by pre setting i_*acl to a value that

  - has bit 0 set, so is_uncached_acl() will return true

  - is not equal to ACL_NOT_CACHED, so get_acl() will not overwrite it

The constant -3 was chosen for this purpose.

Fixes: 39a25b2b3762 ("ovl: define ->get_acl() for overlay inodes")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 2a3a2a3f35249412e35fbb48b743348c40373409)
use ACL_NOT_CACHED instead of ACL_DONT_CACHE since the is_uncached_acl
is not available in the current kernel.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: use cached acl on underlying layer

Instead of calling ->get_acl() directly, use get_acl() to get the cached
value.

We will have the acl cached on the underlying inode anyway, because we do
permission checking on the both the overlay and the underlying fs.

So, since we already have double caching, this improves performance without
any cost.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 5201dc449e4b6b6d7e92f7f974269b11681f98b5)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: remove posix_acl_default from workdir

Clear out posix acl xattrs on workdir and also reset the mode after
creation so that an inherited sgid bit is cleared.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
Orabug: 26401569

(backport upstream commit c11b9fdd6a612f376a5e886505f1c54c16d8c380)
change inode_lock to mutex_lock to match the lock usage in the current
kernel.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: handle umask and posix_acl_default correctly on creation

Setting MS_POSIXACL in sb->s_flags has the side effect of passing mode to
create functions without masking against umask.

Another problem when creating over a whiteout is that the default posix acl
is not inherited from the parent dir (because the real parent dir at the
time of creation is the work directory).

Fix these problems by:

a) If upper fs does not have MS_POSIXACL, then mask mode with umask.

b) If creating over a whiteout, call posix_acl_create() to get the
inherited acls. After creation (but before moving to the final
destination) set these acls on the created file. posix_acl_create() also
updates the file creation mode as appropriate.

Fixes: 39a25b2b3762 ("ovl: define ->get_acl() for overlay inodes")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 38b256973ea90fc7c2b7e1b734fa0e8b83538d50)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: change inode_lock to mutex_lock

Orabug: 26401569

Change the lock type in order to back port the patch to the current
kernel which uses mutex_lock instead of inode_lock.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: proper cleanup of workdir

When mounting overlayfs it needs a clean "work" directory under the
supplied workdir.

Previously the mount code removed this directory if it already existed and
created a new one. If the removal failed (e.g. directory was not empty)
then it fell back to a read-only mount not using the workdir.

While this has never been reported, it is possible to get a non-empty
"work" dir from a previous mount of overlayfs in case of crash in the
middle of an operation using the work directory.

In this case the left over state should be discarded and the overlay
filesystem will be consistent, guaranteed by the atomicity of operations on
moving to/from the workdir to the upper layer.

This patch implements cleaning out any files left in workdir. It is
implemented using real recursion for simplicity, but the depth is limited
to 2, because the worst case is that of a directory containing whiteouts
under "work".

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
Orabug: 26401569

(backport upstream commit eea2fb4851e9dcbab6b991aaf47e2e024f1f55a0)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: rename is_merge to is_lowest

The 'is_merge' is an historical naming from when only a single lower layer
could exist. With the introduction of multiple lower layers the meaning of
this flag was changed to mean only the "lowest layer" (while all lower
layers were being merged).

So now 'is_merge' is inaccurate and hence renaming to 'is_lowest'

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 56656e960b555cb98bc414382566dcb59aae99a2)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: cleanup unused var in rename2

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 6986c012faa480fb0fda74eaae9abb22f7ad1004)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: Switch to generic_getxattr

Now that overlayfs has xattr handlers for iop->{set,remove}xattr, use
those same handlers for iop->getxattr as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Shan Hai <shan.hai@oracle.com>
Orabug: 26401569

(backport upstream commit 0eb45fc3bb7a2cf9c9c93d9e95986a841e5f4625)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: Switch to generic_removexattr

Commit d837a49bd57f ("ovl: fix POSIX ACL setting") switches from
iop->setxattr from ovl_setxattr to generic_setxattr, so switch from
ovl_removexattr to generic_removexattr as well. As far as permission
checking goes, the same rules should apply in either case.

While doing that, rename ovl_setxattr to ovl_xattr_set to indicate that
this is not an iop->setxattr implementation and remove the unused inode
argument.

Move ovl_other_xattr_set above ovl_own_xattr_set so that they match the
order of handlers in ovl_xattr_handlers.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Fixes: d837a49bd57f ("ovl: fix POSIX ACL setting")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Shan Hai <shan.hai@oracle.com>
Orabug: 26401569

(backport upstream commit 0e585ccc13b3edbb187fb4f1b7cc9397f17d64a9)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: Get rid of ovl_xattr_noacl_handlers array

Use an ordinary #ifdef to conditionally include the POSIX ACL handlers
in ovl_xattr_handlers, like the other filesystems do. Flag the code
that is now only used conditionally with __maybe_unused.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 0c97be22f928b85110504c4bbb8574facb4bd0c0)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: Fix OVL_XATTR_PREFIX

Make sure ovl_own_xattr_handler only matches attribute names starting
with "overlay.", not "overlayXXX".

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Fixes: d837a49bd57f ("ovl: fix POSIX ACL setting")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Shan Hai <shan.hai@oracle.com>
Orabug: 26401569

(backport upstream commit fe2b75952347762a21f67d9df1199137ae5988b2)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: xattr filter fix

a) ovl_need_xattr_filter() is wrong, we can have multiple lower layers
overlaid, all of which (except the lowest one) honouring the
"trusted.overlay.opaque" xattr. So need to filter everything except the
bottom and the pure-upper layer.

b) we no longer can assume that inode is attached to dentry in
get/setxattr.

This patch unconditionally filters private xattrs to fix both of the above.
Performance impact for get/removexattrs is likely in the noise.

For listxattrs it might be measurable in pathological cases, but I very
much hope nobody cares. If they do, we'll fix it then.

Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: b96809173e94 ("security_d_instantiate(): move to the point prior to attaching dentry to inode")
Orabug: 26401569

(backport upstream commit b581755b1c565391c72d03b157ba2dd0b18e9d15)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: fix warnings caused by WRITE_ONCE

Orabug: 26401569

The commit 39b681f80(ovl: store real inode pointer in ->i_private)
adds a call to the WRITE_ONCE which generates "initialization makes pointer
from integer without a cast" warnings, fix it by coverting to the required type.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: simplify empty checking

The empty checking logic is duplicated in ovl_check_empty_and_clear() and
ovl_remove_and_whiteout(), except the condition for clearing whiteouts is
different:

ovl_check_empty_and_clear() checked for being upper

ovl_remove_and_whiteout() checked for merge OR lower

Move the intersection of those checks (upper AND merge) into
ovl_check_empty_and_clear() and simplify ovl_remove_and_whiteout().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 30c17ebfb2a11468fe825de19afa3934ee98bfd2)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

qstr: constify instances in overlayfs

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 29c42e80ba5b1e59a4f427b44e2bdebd347b9409)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: clear nlink on rmdir

To make delete notification work on fa/inotify.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit dbc816d05ddcfb189af8784d04fc84c812db3747)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: append MAY_READ when diluting write checks

Right now we remove MAY_WRITE/MAY_APPEND bits from mask if realfile is on
lower/. This is done as files on lower will never be written and will be
copied up. But to copy up a file, mounter should have MAY_READ permission
otherwise copy up will fail. So set MAY_READ in mask when MAY_WRITE is
reset.

Dan Walsh noticed this when he did access(lowerfile, W_OK) and it returned
True (context mounts) but when he tried to actually write to file, it
failed as mounter did not have permission on lower file.

[SzM] don't set MAY_READ if only MAY_APPEND is set without MAY_WRITE; this
won't trigger a copy-up.

Reported-by: Dan Walsh <dwalsh@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 500cac3ccee65526d5075da3af2674101305bf8c)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: dilute permission checks on lower only if not special file

Right now if file is on lower/, we remove MAY_WRITE/MAY_APPEND bits from
mask as lower/ will never be written and file will be copied up. But this
is not true for special files. These files are not copied up and are opened
in place. So don't dilute the checks for these types of files.

Reported-by: Dan Walsh <dwalsh@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit e29841a0ab3d03e77313abd8fb4c16e80fb26e29)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: fix POSIX ACL setting

Setting POSIX ACL needs special handling:

1) Some permission checks are done by ->setxattr() which now uses mounter's
creds ("ovl: do operations on underlying file system in mounter's
context"). These permission checks need to be done with current cred as
well.

2) Setting ACL can fail for various reasons. We do not need to copy up in
these cases.

In the mean time switch to using generic_setxattr.

[Arnd Bergmann] Fix link error without POSIX ACL. posix_acl_from_xattr()
doesn't have a 'static inline' implementation when CONFIG_FS_POSIX_ACL is
disabled, and I could not come up with an obvious way to do it.

This instead avoids the link error by defining two sets of ACL operations
and letting the compiler drop one of the two at compile time depending
on CONFIG_FS_POSIX_ACL. This avoids all references to the ACL code,
also leading to smaller code.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit d837a49bd57f1ec2f6411efa829fecc34002b110)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: share inode for hard link

Inode attributes are copied up to overlay inode (uid, gid, mode, atime,
mtime, ctime) so generic code using these fields works correcty. If a hard
link is created in overlayfs separate inodes are allocated for each link.
If chmod/chown/etc. is performed on one of the links then the inode
belonging to the other ones won't be updated.

This patch attempts to fix this by sharing inodes for hard links.

Use inode hash (with real inode pointer as a key) to make sure overlay
inodes are shared for hard links on upper. Hard links on lower are still
split (which is not user observable until the copy-up happens, see
Documentation/filesystems/overlayfs.txt under "Non-standard behavior").

The inode is only inserted in the hash if it is non-directoy and upper.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 51f7e52dc943468c6929fa0a82d4afac3c8e9636)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: store real inode pointer in ->i_private

To get from overlay inode to real inode we currently use 'struct
ovl_entry', which has lifetime connected to overlay dentry. This is okay,
since each overlay dentry had a new overlay inode allocated.

Following patch will break that assumption, so need to leave out ovl_entry.
This patch stores the real inode directly in i_private, with the lowest bit
used to indicate whether the inode is upper or lower.

Lifetime rules remain, using ovl_inode_real() must only be done while
caller holds ref on overlay dentry (and hence on real dentry), or within
RCU protected regions.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 39b681f8026c170a73972517269efc830db0d7ce)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: permission: return ECHILD instead of ENOENT

The error is due to RCU and is temporary.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit a999d7e161a085e30181d0a88f049bd92112e172)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: update atime on upper

Fix atime update logic in overlayfs.

This patch adds an i_op->update_time() handler to overlayfs inodes.  This
forwards atime updates to the upper layer only.  No atime updates are done
on lower layers.

Remove implicit atime updates to underlying files and directories with
O_NOATIME.  Remove explicit atime update in ovl_readlink().

Clear atime related mnt flags from cloned upper mount.  This means atime
updates are controlled purely by overlayfs mount options.

Reported-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit d719e8f268fa4f9944b24b60814da9017dfb7787)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: convert inode_lock to mutex_lock

Orabug: 26401569

This patch is a fix to the conflict of the upstream commit bb0d2b8ad29
(ovl: fix sgid on directory), convert the lock type to match with the
lock usage of the current kernel.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: fix sgid on directory

When creating directory in workdir, the group/sgid inheritance from the
parent dir was omitted completely. Fix this by calling inode_init_owner()
on overlay inode and using the resulting uid/gid/mode to create the file.

Unfortunately the sgid bit can be stripped off due to umask, so need to
reset the mode in this case in workdir before moving the directory in
place.

Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit bb0d2b8ad29630b580ac903f989e704e23462357)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: simplify permission checking

The fact that we always do permission checking on the overlay inode and
clear MAY_WRITE for checking access to the lower inode allows cruft to be
removed from ovl_permission().

1) "default_permissions" option effectively did generic_permission() on the
overlay inode with i_mode, i_uid and i_gid updated from underlying
filesystem. This is what we do by default now. It did the update using
vfs_getattr() but that's only needed if the underlying filesystem can
change (which is not allowed). We may later introduce a "paranoia_mode"
that verifies that mode/uid/gid are not changed.

2) splitting out the IS_RDONLY() check from inode_permission() also becomes
unnecessary once we remove the MAY_WRITE from the lower inode check.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 9c630ebefeeee4363ffd29f2f9b18eddafc6479c)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: do not require mounter to have MAY_WRITE on lower

Now we have two levels of checks in ovl_permission(). overlay inode
is checked with the creds of task while underlying inode is checked
with the creds of mounter.

Looks like mounter does not have to have WRITE access to files on lower/.
So remove the MAY_WRITE from access mask for checks on underlying
lower inode.

This means task should still have the MAY_WRITE permission on lower
inode and mounter is not required to have MAY_WRITE.

It also solves the problem of read only NFS mounts being used as lower.
If __inode_permission(lower_inode, MAY_WRITE) is called on read only
NFS, it fails. By resetting MAY_WRITE, check succeeds and case of
read only NFS shold work with overlay without having to specify any
special mount options (default permission).

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 754f8cb72b42a3a6100d2bbb1cb885361a7310dd)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: do operations on underlying file system in mounter's context

Given we are now doing checks both on overlay inode as well underlying
inode, we should be able to do checks and operations on underlying file
system using mounter's context.

So modify all operations to do checks/operations on underlying dentry/inode
in the context of mounter.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 1175b6b8d96331676f1d436b089b965807f23b4a)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: fix uid/gid when creating over whiteout

Fix a regression when creating a file over a whiteout.  The new
file/directory needs to use the current fsuid/fsgid, not the ones from the
mounter's credentials.

The refcounting is a bit tricky: prepare_creds() sets an original refcount,
override_creds() gets one more, which revert_cred() drops.  So

  1) we need to expicitly put the mounter's credentials when overriding
     with the updated one

  2) we need to put the original ref to the updated creds (and this can
     safely be done before revert_creds(), since we'll still have the ref
     from override_creds()).

Reported-by: Stephen Smalley <sds@tycho.nsa.gov>
Fixes: 3fe6e52f0626 ("ovl: override creds with the ones from the superblock mounter")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit d0e13f5bbe4be7c8f27736fc40503dcec04b7de0)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: modify ovl_permission() to do checks on two inodes

Right now ovl_permission() calls __inode_permission(realinode), to do
permission checks on real inode and no checks are done on overlay inode.

Modify it to do checks both on overlay inode as well as underlying inode.
Checks on overlay inode will be done with the creds of calling task while
checks on underlying inode will be done with the creds of mounter.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit c0ca3d70e8d3cf81e2255a217f7ca402f5ed0862)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: define ->get_acl() for overlay inodes

Now we are planning to do DAC permission checks on overlay inode
itself. And to make it work, we will need to make sure we can get acls from
underlying inode. So define ->get_acl() for overlay inodes and this in turn
calls into underlying filesystem to get acls, if any.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 39a25b2b37629f65e5a1eba1b353d0b47687c2ca)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: move some common code in a function

ovl_create_upper() and ovl_create_over_whiteout() seem to be sharing some
common code which can be moved into a separate function. No functionality
change.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 72e48481815eeca72fc886b3be91301ad87d6aeb)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: store ovl_entry in inode->i_private for all inodes

Previously this was only done for directory inodes. Doing so for all
inodes makes for a nice cleanup in ovl_permission at zero cost.

Inodes are not shared for hard links on the overlay, so this works fine.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 58ed4e70f253d80ed72faba7873dc11603b398bc)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: use generic_delete_inode

No point in keeping overlay inodes around since they will never be reused.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit eead4f2dc4f851a3790c49850e96a1d155bf5451)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: check mounter creds on underlying lookup

The hash salting changes meant that we can no longer reuse the hash in the
overlay dentry to look up the underlying dentry.

Instead of lookup_hash(), use lookup_one_len_unlocked() and swith to
mounter's creds (like we do for all other operations later in the series).

Now the lookup_hash() export introduced in 4.6 by 3c9fe8cdff1b ("vfs: add
lookup_hash() helper") is unused and can possibly be removed; its
usefulness negated by the hash salting and the idea that mounter's creds
should be used on operations on underlying filesystems.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 8387ff2577eb ("vfs: make the string hashes salt the hash")
Orabug: 26401569

(backport upstream commit c1b2cc1a765aff4df7b22abe6b66014236f73eba)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: ignore permissions on underlying lookup

Generally permission checking is not necessary when overlayfs looks up a
dentry on one of the underlying layers, since search permission on base
directory was already checked in ovl_permission().

More specifically using lookup_one_len() causes a problem when the lower
directory lacks search permission for a specific user while the upper
directory does have search permission. Since lookups are cached, this
causes inconsistency in behavior: success depends on who did the first
lookup.

So instead use lookup_hash() which doesn't do the permission check.

Reported-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 38b78a5f18584db6fa7441e0f4531b283b0e6725)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: override creds with the ones from the superblock mounter

In user namespace the whiteout creation fails with -EPERM because the
current process isn't capable(CAP_SYS_ADMIN) when setting xattr.

A simple reproducer:

$ mkdir upper lower work merged lower/dir
$ sudo mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
$ unshare -m -p -f -U -r bash

Now as root in the user namespace:

\# touch merged/dir/{1,2,3} # this will force a copy up of lower/dir
\# rm -fR merged/*

This ends up failing with -EPERM after the files in dir has been
correctly deleted:

unlinkat(4, "2", 0)                     = 0
unlinkat(4, "1", 0)                     = 0
unlinkat(4, "3", 0)                     = 0
close(4)                                = 0
unlinkat(AT_FDCWD, "merged/dir", AT_REMOVEDIR) = -1 EPERM (Operation not
permitted)

Interestingly, if you don't place files in merged/dir you can remove it,
meaning if upper/dir does not exist, creating the char device file works
properly in that same location.

This patch uses ovl_sb_creator_cred() to get the cred struct from the
superblock mounter and override the old cred with these new ones so that
the whiteout creation is possible because overlay is wrong in assuming that
the creds it will get with prepare_creds will be in the initial user
namespace.  The old cap_raise game is removed in favor of just overriding
the old cred struct.

This patch also drops from ovl_copy_up_one() the following two lines:

override_cred->fsuid = stat->uid;
override_cred->fsgid = stat->gid;

This is because the correct uid and gid are taken directly with the stat
struct and correctly set with ovl_set_attr().

Signed-off-by: Antonio Murdaca <runcom@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 3fe6e52f062643676eb4518d68cee3bc1272091b)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: fix dentry leak for default_permissions

When using the 'default_permissions' mount option, ovl_permission() on
non-directories was missing a dput(alias), resulting in "BUG Dentry still
in use".

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 8d3095f4ad47 ("ovl: default permissions")
Cc: <stable@vger.kernel.org> # v4.5+
Orabug: 26401569

(backport upstream commit a4859d75944a726533ab86d24bb5ffd1b2b7d6cc)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: fix open in stacked overlay

If two overlayfs filesystems are stacked on top of each other, then we need
recursion in ovl_d_select_inode().

I guess d_backing_inode() is supposed to do that. But currently it doesn't
and that functionality is open coded in vfs_open(). This is now copied
into ovl_d_select_inode() to fix this regression.

Reported-by: Alban Crequy <alban.crequy@gmail.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Fixes: 4bacc9c9234c ("overlayfs: Make f_path always point to the overlay...")
Cc: David Howells <dhowells@redhat.com>
Cc: <stable@vger.kernel.org> # v4.2+
Orabug: 26401569

(backport upstream commit 1c8a47df36d72ace8cf78eb6c228aa0f8027d3c2)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

nfsd: don't hold i_mutex over userspace upcalls

We need information about exports when crossing mountpoints during
lookup or NFSv4 readdir.  If we don't already have that information
cached, we may have to ask (and wait for) rpc.mountd.

In both cases we currently hold the i_mutex on the parent of the
directory we're asking rpc.mountd about.  We've seen situations where
rpc.mountd performs some operation on that directory that tries to take
the i_mutex again, resulting in deadlock.

With some care, we may be able to avoid that in rpc.mountd.  But it
seems better just to avoid holding a mutex while waiting on userspace.

It appears that lookup_one_len is pretty much the only operation that
needs the i_mutex.  So we could just drop the i_mutex elsewhere and do
something like

mutex_lock()
lookup_one_len()
mutex_unlock()

In many cases though the lookup would have been cached and not required
the i_mutex, so it's more efficient to create a lookup_one_len() variant
that only takes the i_mutex when necessary.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Orabug: 26401569

(backport upstream commit bbddca8e8fac07ece3938e03526b5d00fa791a4c)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

Revert "ixgbevf: get rid of custom busy polling code"

This reverts commit 1975e69c708706b84d9462ce7c0135d33310c28a. Performance regression,
because the net/core napi support is not present.

Orabug: 26494997
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>

Revert "ixgbe: get rid of custom busy polling code"

This reverts commit 9244251e4f45dc9a61dd094a5d7ba23bb0285a86. The core napi support is
not in place, we need to keep the driver support or performance suffers.

Orabug: 26494997
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>

ocfs2: fix deadlock caused by recursive locking in xattr

Orabug: 26427132

Another deadlock path caused by recursive locking is reported.  This
kind of issue was introduced since commit 743b5f1434f5 ("ocfs2: take
inode lock in ocfs2_iop_set/get_acl()").  Two deadlock paths have been
fixed by commit b891fa5024a9 ("ocfs2: fix deadlock issue when taking
inode lock at vfs entry points").  Yes, we intend to fix this kind of
case in incremental way, because it's hard to find out all possible
paths at once.

This one can be reproduced like this.  On node1, cp a large file from
home directory to ocfs2 mountpoint.  While on node2, run
setfacl/getfacl.  Both nodes will hang up there.  The backtraces:

On node1:
  __ocfs2_cluster_lock.isra.39+0x357/0x740 [ocfs2]
  ocfs2_inode_lock_full_nested+0x17d/0x840 [ocfs2]
  ocfs2_write_begin+0x43/0x1a0 [ocfs2]
  generic_perform_write+0xa9/0x180
  __generic_file_write_iter+0x1aa/0x1d0
  ocfs2_file_write_iter+0x4f4/0xb40 [ocfs2]
  __vfs_write+0xc3/0x130
  vfs_write+0xb1/0x1a0
  SyS_write+0x46/0xa0

On node2:
  __ocfs2_cluster_lock.isra.39+0x357/0x740 [ocfs2]
  ocfs2_inode_lock_full_nested+0x17d/0x840 [ocfs2]
  ocfs2_xattr_set+0x12e/0xe80 [ocfs2]
  ocfs2_set_acl+0x22d/0x260 [ocfs2]
  ocfs2_iop_set_acl+0x65/0xb0 [ocfs2]
  set_posix_acl+0x75/0xb0
  posix_acl_xattr_set+0x49/0xa0
  __vfs_setxattr+0x69/0x80
  __vfs_setxattr_noperm+0x72/0x1a0
  vfs_setxattr+0xa7/0xb0
  setxattr+0x12d/0x190
  path_setxattr+0x9f/0xb0
  SyS_setxattr+0x14/0x20

Fix this one by using ocfs2_inode_{lock|unlock}_tracker, which is
exported by commit 439a36b8ef38 ("ocfs2/dlmglue: prepare tracking logic
to avoid recursive cluster lock").

Link: http://lkml.kernel.org/r/20170622014746.5815-1-zren@suse.com
Fixes: 743b5f1434f5 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
Signed-off-by: Eric Ren <zren@suse.com>
Reported-by: Thomas Voegtle <tv@lio96.de>
Tested-by: Thomas Voegtle <tv@lio96.de>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherrypicked from commit 8818efaaacb78c60a9d90c5705b6c99b75d7d442)
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>

ocfs2: fix deadlock issue when taking inode lock at vfs entry points

Orabug: 26427132

Conflicts:
    fs/ocfs2/file.c

Commit 3acdc8b3862a results in a deadlock, as the author realized shortly
after the patch was merged.  The discussion happened here

https://oss.oracle.com/pipermail/ocfs2-devel/2015-September/011085.html

The reason why taking cluster inode lock at vfs entry points opens up a
self deadlock window, is explained in the previous patch of this series.

So far, we have seen two different code paths that have this issue.

1. do_sys_open
     may_open
       inode_permission
        ocfs2_permission
         ocfs2_inode_lock() <=== take PR
          generic_permission
           get_acl
            ocfs2_iop_get_acl
             ocfs2_inode_lock() <=== take PR

2. fchmod|fchmodat
    chmod_common
     notify_change
      ocfs2_setattr <=== take EX
       posix_acl_chmod
        get_acl
         ocfs2_iop_get_acl <=== remote PR request
        ocfs2_iop_set_acl <=== take EX

Fixes them by adding the tracking logic (in the previous patch) for these
funcs above, ocfs2_permission(), ocfs2_iop_[set|get]_acl(),
ocfs2_setattr().

Link: http://lkml.kernel.org/r/20170117100948.11657-3-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherrypicked from commit b891fa5024a95c77e0d6fd6655cb74af6fb77f46)
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>

ocfs2/dlmglue: prepare tracking logic to avoid recursive cluster lock

Orabug: 26427132

We are in the situation that we have to avoid recursive cluster locking,
but there is no way to check if a cluster lock has been taken by a precess
already.

Mostly, we can avoid recursive locking by writing code carefully.
However, we found that it's very hard to handle the routines that are
invoked directly by vfs code.  For instance:

  const struct inode_operations ocfs2_file_iops = {
      .permission     = ocfs2_permission,
      .get_acl        = ocfs2_iop_get_acl,
      .set_acl        = ocfs2_iop_set_acl,
  };

Both ocfs2_permission() and ocfs2_iop_get_acl() call ocfs2_inode_lock(PR):

  do_sys_open
   may_open
    inode_permission
     ocfs2_permission
      ocfs2_inode_lock() <=== first time
       generic_permission
        get_acl
         ocfs2_iop_get_acl
   ocfs2_inode_lock() <=== recursive one

A deadlock will occur if a remote EX request comes in between two of
ocfs2_inode_lock().  Briefly describe how the deadlock is formed:

On one hand, OCFS2_LOCK_BLOCKED flag of this lockres is set in
BAST(ocfs2_generic_handle_bast) when downconvert is started on behalf of
the remote EX lock request.  Another hand, the recursive cluster lock
(the second one) will be blocked in in __ocfs2_cluster_lock() because of
OCFS2_LOCK_BLOCKED.  But, the downconvert never complete, why? because
there is no chance for the first cluster lock on this node to be
unlocked - we block ourselves in the code path.

The idea to fix this issue is mostly taken from gfs2 code.

1. introduce a new field: struct ocfs2_lock_res.l_holders, to keep track
   of the processes' pid who has taken the cluster lock of this lock
   resource;

2. introduce a new flag for ocfs2_inode_lock_full:
   OCFS2_META_LOCK_GETBH; it means just getting back disk inode bh for
   us if we've got cluster lock.

3. export a helper: ocfs2_is_locked_by_me() is used to check if we have
   got the cluster lock in the upper code path.

The tracking logic should be used by some of the ocfs2 vfs's callbacks,
to solve the recursive locking issue cuased by the fact that vfs
routines can call into each other.

The performance penalty of processing the holder list should only be
seen at a few cases where the tracking logic is used, such as get/set
acl.

You may ask what if the first time we got a PR lock, and the second time
we want a EX lock? fortunately, this case never happens in the real
world, as far as I can see, including permission check,
(get|set)_(acl|attr), and the gfs2 code also do so.

[sfr@canb.auug.org.au remove some inlines]
Link: http://lkml.kernel.org/r/20170117100948.11657-2-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherrypicked from commit 439a36b8ef38657f765b80b775e2885338d72451)
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>

Revert "add OCFS2_LOCK_RECURSIVE arg_flags to ocfs2_cluster_lock() to prevent hang"

Orabug: 26427132

This reverts commit 387775a70d2e5d89d1a81d78d66655337a5c2765.

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>

xen/blkfront: always allocate grants first from per-queue persistent grants

This patch partially reverts 3df0e50 ("xen/blkfront: pseudo support for
multi hardware queues/rings"). The xen-blkfront queue/ring might hang due
to grants allocation failure in the situation when gnttab_free_head is
almost empty while many persistent grants are reserved for this queue/ring.

As persistent grants management was per-queue since 73716df ("xen/blkfront:
make persistent grants pool per-queue"), we should always allocate from
persistent grants first.

Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 26351401

upstream commit: bd912ef3e46b6edb51bb8af4b73fd2be7817e305
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com>

rds: Make sure updates to cp_send_gen can be observed

cp->cp_send_gen is treated as a normal variable, although it may be
used by different threads.

This is fixed by using {READ,WRITE}_ONCE when it is incremented and
READ_ONCE when it is read outside the {acquire,release}_in_xmit
protection.

Normative reference from the Linux-Kernel Memory Model:

    Loads from and stores to shared (but non-atomic) variables should
    be protected with the READ_ONCE(), WRITE_ONCE(), and
    ACCESS_ONCE().

Clause 5.1.2.4/25 in the C standard is also relevant.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry-picked from upstream e623a48ee433985f6ca0fb238f0002cc2eccdf53)

Orabug: 26519030

Reviewed-by: Alan Maguire <alan.maguire@oracle.com>

NFSv4.1: Handle EXCHGID4_FLAG_CONFIRMED_R during NFSv4.1 migration

Transparent State Migration copies a client's lease state from the
server where a filesystem used to reside to the server where it now
resides. When an NFSv4.1 client first contacts that destination
server, it uses EXCHANGE_ID to detect trunking relationships.

The lease that was copied there is returned to that client, but the
destination server sets EXCHGID4_FLAG_CONFIRMED_R when replying to
the client. This is because the lease was confirmed on the source
server (before it was copied).

Normally, when CONFIRMED_R is set, a client purges the lease and
creates a new one. However, that throws away the entire benefit of
Transparent State Migration.

Therefore, the client must not purge that lease when it is possible
that Transparent State Migration has occurred.

Reported-by: Xuan Qi <xuan.qi@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Xuan Qi <xuan.qi@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
(cherry picked from commit 8dcbec6d20eb881ba368d0aebc3a8a678aebb1da)

Orabug: 25727872
Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>

xen: do not re-use pirq number cached in pci device msi msg data

Revert the main part of commit:
af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")

That commit introduced reading the pci device's msi message data to see
if a pirq was previously configured for the device's msi/msix, and re-use
that pirq.  At the time, that was the correct behavior.  However, a
later change to Qemu caused it to call into the Xen hypervisor to unmap
all pirqs for a pci device, when the pci device disables its MSI/MSIX
vectors; specifically the Qemu commit:
c976437c7dba9c7444fb41df45468968aaa326ad
("qemu-xen: free all the pirqs for msi/msix when driver unload")

Once Qemu added this pirq unmapping, it was no longer correct for the
kernel to re-use the pirq number cached in the pci device msi message
data.  All Qemu releases since 2.1.0 contain the patch that unmaps the
pirqs when the pci device disables its MSI/MSIX vectors.

This bug is causing failures to initialize multiple NVMe controllers
under Xen, because the NVMe driver sets up a single MSIX vector for
each controller (concurrently), and then after using that to talk to
the controller for some configuration data, it disables the single MSIX
vector and re-configures all the MSIX vectors it needs.  So the MSIX
setup code tries to re-use the cached pirq from the first vector
for each controller, but the hypervisor has already given away that
pirq to another controller, and its initialization fails.

This is discussed in more detail at:
https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html

Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
(cherry picked from commit c74fd80f2f41d05f350bb478151021f88551afe8)

Orabug: 26547167

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Re-Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

MacSec: fix backporting error in patches for CVE-2017-7477

Orabug: 26443893

- macsec: dynamically allocate space for sglist (Jason A. Donenfeld)
[Orabug: 26368162]  {CVE-2017-7477}
- macsec: avoid heap overflow in skb_to_sgvec (Jason A. Donenfeld)  [Orabug:
26368162]  {CVE-2017-7477}

The backporting of above patches introduded a heap overrun error shown
as bug 26443893.

------------[ cut here ]------------
WARNING: CPU: 28 PID: 0 at kernel/time/timer.c:1177
call_timer_fn+0x142/0x150()
timer: mld_ifc_timer_expire+0x0/0x2d0 preempt leak: 00000100 -> 00000101
Modules linked in: gcm macsec fuse btrfs xor raid6_pq vfat msdos fat ext4
jbd2 ext2 mbcache2 ip6table_filter ip6_tables
BUG: workqueue leaked lock or atomic: kworker/15:2/0x00000001/689
     last function: addrconf_dad_work
CPU: 15 PID: 689 Comm: kworker/15:2 Not tainted 4.1.12-103.2.6.el7uek.x86_64
Hardware name: Oracle Corporation SUN SERVER X4-2       /ASSY,MOTHERBOARD,1U
, BIOS 25010603 01/16/2014
Workqueue: ipv6_addrconf addrconf_dad_work

Call Trace:
[<ffffffff81735938>] dump_stack+0x63/0x81
[<ffffffff810a0fd8>] process_one_work+0x3a8/0x460
[<ffffffff810a1582>] worker_thread+0x112/0x520
[<ffffffff810a1470>] ? rescuer_thread+0x3e0/0x3e0
[<ffffffff810a7348>] kthread+0xd8/0xf0
[<ffffffff810a7270>] ? kthread_create_on_node+0x1b0/0x1b0
[<ffffffff8173d9a2>] ret_from_fork+0x42/0x70
[<ffffffff810a7270>] ? kthread_create_on_node+0x1b0/0x1b0
BUG: scheduling while atomic: kworker/15:2/689/0x00000001

1. newly introduced variable "num_frags" not used in 'sg_ad', assumes
'MAX_SKB_FRAGS + 1'

2. Initialization of sglist assumes 'MAX_SKB_FRAGS + 1' length, though it was
changed to the number of scatterlist elements being returned from
"skb_cow_data()"

3. It seems that "sg_init_table(sg, MAX_SKB_FRAGS + 1);" is redundant, it was
already done a few lines before.

This patch may solve the above issues.

Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>

ovl: move super block magic number to magic.h

Orabug: 26546379, 26540706
CVE-2016-1575
CVE-2016-1576

The overlayfs file system is not recognized by programs
like tail because the magic number is not in standard header location.

Move it so that the value will propagate on for the GNU library
and utilities. Needs to go in the fstatfs manual page as well.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
(cherry picked from commit 257f871993474e2bde6c497b54022c362cf398e1)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

ovl: use a minimal buffer in ovl_copy_xattr

Orabug: 26546379, 26540706
CVE-2016-1575
CVE-2016-1576

Rather than always allocating the high-order XATTR_SIZE_MAX buffer
which is costly and prone to failure, only allocate what is needed and
realloc if necessary.

Fixes https://github.com/coreos/bugs/issues/489

Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org>
(cherry picked from commit e4ad29fa0d224d05e08b2858e65f112fd8edd4fe)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

ovl: allow zero size xattr

Orabug: 26546379, 26540706
CVE-2016-1575
CVE-2016-1576

When ovl_copy_xattr() encountered a zero size xattr no more xattrs were
copied and the function returned success. This is clearly not the desired
behavior.

Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org>
(cherry picked from commit 97daf8b97ad6f913a34c82515be64dc9ac08d63e)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

ovl: default permissions

Orabug: 26546379, 26540706
CVE-2016-1575
CVE-2016-1576

Add mount option "default_permissions" to alter the way permissions are
calculated.

Without this option and prior to this patch permissions were calculated by
underlying lower or upper filesystem.

With this option the permissions are calculated by overlayfs based on the
file owner, group and mode bits.

This has significance for example when a read-only exported NFS filesystem
is used as a lower layer. In this case the underlying NFS filesystem will
reply with EROFS, in which case all we know is that the filesystem is
read-only. But that's not what we are interested in, we are interested in
whether the access would be allowed if the filesystem wasn't read-only; the
server doesn't tell us that, and would need updating at various levels,
which doesn't seem practicable.

Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
(cherry picked from commit 8d3095f4ad47ac409440a0ba1c80e13519ff867d)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

uek-rpm: Add missing .ko files to ueknano modules list

Orabug: 26521422

When installing kernel-ueknano rpm package, warnings are displayed due to missing
symbols. This commit adds modules with needed symbols to /lib/modules/ directory.

Fixes: 74d5ebd39bfa ("uek-rpm: Share specfile for both kernel-ueknano and kernel-uek")
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com>

ping: implement proper locking

We got a report of yet another bug in ping

http://www.openwall.com/lists/oss-security/2017/03/24/6

->disconnect() is not called with socket lock held.

Fix this by acquiring ping rwlock earlier.

Thanks to Daniel, Alexander and Andrey for letting us know this problem.

Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Daniel Jiang <danieljiang0415@gmail.com>
Reported-by: Solar Designer <solar@openwall.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 43a6684519ab0a6c52024b5e25322476cabad893)

Orabug: 25883225
CEV: CVE-2017-2671

Signed-off-by: Tim Tianyang Chen <tianyang.chen@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>

xen-blkback: stop blkback thread of every queue in xen_blkif_disconnect

If there is inflight I/O in any non-last queue, blkback returns -EBUSY
directly, and never stops thread of remaining queue and processs them. When
removing vbd device with lots of disk I/O load, some queues with inflight
I/O still have blkback thread running even though the corresponding vbd
device or guest is gone.
And this could cause some problems, for example, if the backend device type
is file, some loop devices and blkback thread always lingers there forever
after guest is destroyed, and this causes failure of umounting repositories
unless rebooting the dom0. So stop all threads properly and return -EBUSY
if any queue has inflight I/O.

OraBug: 26539922

Signed-off-by: Annie Li <annie.li@oracle.com>
Reviewed-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>

uek-rpm: Share specfile for both kernel-ueknano and kernel-uek

Orabug: 26521422

The kernel-ueknano is added as sub package in kernel-uek spec file to
create both the binary rpms from the same source. The list of modules to
be added to kernel-ueknano is passed as input to the spec file.

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed By: Ashok Vairavan <ashok.vairavan@oracle.com>
Reviewed-By: Todd Vierling <todd.vierling@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>

PCI: Workaround wrong flags completions for IDT switch

The IDT switch incorrectly flags an ACS source violation on a read config
request to an end point device on the completion (IDT 89H32H8G3-YC,
errata #36) even though the PCI Express spec states that completions are
never affected by ACS source violation (PCI Spec 3.1, Section 6.12.1.1).

The suggested workaround by IDT is to issue a configuration write to the
downstream device before issuing the first config read. This allows the
downstream device to capture its bus number, thus avoiding the ACS
violation on the completion.

The patch does the following -

1. Disable ACS source violation if enabled
2. Wait for config space access to become available by reading vendor id
3. Do a config write to the end point (errata workaround)
4. Enable ACS source validation (if it was enabled to begin with)

-v2: move workaround to pci_bus_read_dev_vendor_id() from
pci_bus_check_dev()
      and move enable_acs_sv to drivers/pci/pci.c -- by Yinghai
-v3: add bus->self check for root bus and virtual bus for sriov vfs.
-v4: only do workaround for IDT switches
-v5: tweak pci_std_enable_acs_sv to deal with unimplemented SV and
clarify return value

Signed-off-by: James Puthukattukaran <james.puthukattukaran@oracle.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
--

  drivers/pci/pci.c   | 37 +++++++++++++++++++++++++++++++++++++
  drivers/pci/pci.h   |  1 +
  drivers/pci/probe.c | 38 ++++++++++++++++++++++++++++++++++++--
  3 files changed, 74 insertions(+), 2 deletions(-)

Orabug: 26243152

Link: https://patchwork.kernel.org/patch/9828571/
Fixed wrapped lines in the original patch to fit the lines in 80 columns.
Changed variable types from integer to bool to keep consistent with function
return type.

Signed-off-by: Shan Hai <shan.hai@oracle.com>

Revert "SUNRPC: Refactor svc_set_num_threads()"

This reverts commit 0bc9402329824ae06d8a26a73b60d21cdac3e6f2.

Orabug: 26479081
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>

Revert "NFSv4: Fix callback server shutdown"

This reverts commit 13757272bec28f1b8be59f56292e8f17076923b3.

Orabug: 26479081
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>

nmi_backtrace: generate one-line reports for idle cpus

When doing an nmi backtrace of many cores, most of which are idle, the
output is a little overwhelming and very uninformative. Suppress
messages for cpus that are idling when they are interrupted and just
emit one line, "NMI backtrace for N skipped: idling at pc 0xNNN".

We do this by grouping all the cpuidle code together into a new
.cpuidle.text section, and then checking the address of the interrupted
PC to see if it lies within that section.

This commit suitably tags x86 and tile idle routines, and only adds in
the minimal framework for other architectures.

Link: http://lkml.kernel.org/r/1472487169-14923-5-git-send-email-cmetcalf@mellanox.com
Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Thompson <daniel.thompson@linaro.org> [arm]
Tested-by: Petr Mladek <pmladek@suse.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 6727ad9e206cc08b80d8000a4d67f8417e53539d)

Orabug: 25925689

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
arch/arm/kernel/vmlinux-xip.lds.S
arch/h8300/kernel/vmlinux.lds.S
arch/x86/kernel/process.c
drivers/acpi/processor_idle.c
kernel/sched/idle.c
lib/nmi_backtrace.c

netfilter: nf_tables: fix oob access

BUG: KASAN: slab-out-of-bounds in nf_tables_rule_destroy+0xf1/0x130 at addr ffff88006a4c35c8
Read of size 8 by task nft/1607

When we've destroyed last valid expr, nft_expr_next() returns an invalid expr.
We must not dereference it unless it passes != nft_expr_last() check.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit 3e38df136e453aa69eb4472108ebce2fb00b1ba6)

Orabug: 25960439,26492640,26492632

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>

scsi: libiscsi: use kvzalloc for iscsi_pool_init

iscsiadm session login can fail with the following error:

iscsiadm: Could not login to [iface: default, target: iqn.1986-03.com...
iscsiadm: initiator reported error (9 - internal error)

When /etc/iscsi/iscsid.conf sets node.session.cmds_max = 4096, it
results in 64K-sized kmallocs per session. A system under fragmented
slab pressure may not have any 64K objects available and fail iscsiadm
session login. Even though memory objects of a smaller size are
available, the large order allocation ends up failing.

The kernel prints a warning and does dump_stack, like below:

iscsid: page allocation failure: order:4, mode:0xc0d0
CPU: 0 PID: 2456 Comm: iscsid Not tainted 4.1.12-61.1.28.el6uek.x86_64 #2
Call Trace:
[<ffffffff816c6e40>] dump_stack+0x63/0x83
[<ffffffff8118e58a>] warn_alloc_failed+0xea/0x140
[<ffffffff81191df9>] __alloc_pages_slowpath+0x409/0x760
[<ffffffff81192401>] __alloc_pages_nodemask+0x2b1/0x2d0
[<ffffffffa048f6c0>] ? dev_attr_host_ipaddress+0x20/0xffffffffffffc722
[<ffffffff811dc38f>] alloc_pages_current+0xaf/0x170
[<ffffffff81192581>] alloc_kmem_pages+0x31/0xd0
[<ffffffffa048f600>] ? iscsi_transport_group+0x20/0xffffffffffffc7e2
[<ffffffff811ad738>] kmalloc_order+0x18/0x50
[<ffffffff811ad7a4>] kmalloc_order_trace+0x34/0xe0
[<ffffffff8146ee30>] ? transport_remove_classdev+0x70/0x70
[<ffffffff811e843d>] __kmalloc+0x27d/0x2a0
[<ffffffff810c8cbd>] ? complete_all+0x4d/0x60
[<ffffffffa04af299>] iscsi_pool_init+0x69/0x160 [libiscsi]
[<ffffffff81465d90>] ? device_initialize+0xb0/0xd0
[<ffffffffa04af510>] iscsi_session_setup+0x180/0x2f4 [libiscsi]
[<ffffffffa04c5a60>] ? iscsi_max_lun+0x20/0xfffffffffffffa9e [iscsi_tcp]
[<ffffffffa04c531f>] iscsi_sw_tcp_session_create+0xcf/0x150 [iscsi_tcp]
[<ffffffffa04c5a60>] ? iscsi_max_lun+0x20/0xfffffffffffffa9e [iscsi_tcp]
[<ffffffffa048a633>] iscsi_if_create_session+0x33/0xd0
[<ffffffffa04c5a60>] ? iscsi_max_lun+0x20/0xfffffffffffffa9e [iscsi_tcp]
[<ffffffffa048abd8>] iscsi_if_recv_msg+0x508/0x8c0 [scsi_transport_iscsi]
[<ffffffff811922eb>] ? __alloc_pages_nodemask+0x19b/0x2d0
[<ffffffff811e6d69>] ? __kmalloc_node_track_caller+0x209/0x2c0
[<ffffffffa048b00c>] iscsi_if_rx+0x7c/0x200 [scsi_transport_iscsi]
[<ffffffff81623dc6>] netlink_unicast+0x126/0x1c0
[<ffffffff8162468c>] netlink_sendmsg+0x36c/0x400
[<ffffffff815d2fed>] sock_sendmsg+0x4d/0x60
[<ffffffff815d596a>] ___sys_sendmsg+0x30a/0x330
[<ffffffff811bc72c>] ? handle_pte_fault+0x20c/0x230
[<ffffffff811bc90c>] ? __handle_mm_fault+0x1bc/0x330
[<ffffffff811bcb32>] ? handle_mm_fault+0xb2/0x1a0
[<ffffffff815d5b99>] __sys_sendmsg+0x49/0x90
[<ffffffff815d5bf9>] SyS_sendmsg+0x19/0x20
[<ffffffff816cbb2e>] system_call_fastpath+0x12/0x71

Use kvzalloc for iscsi_pool in iscsi_pool_init.

Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Tested-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Joseph Slember <joe.slember@oracle.com>
Reviewed-by: Lance Hartmann <lance.hartmann@oracle.com>
Acked-by: Lee Duncan <lduncan@suse.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 26473178
(cherry picked from commit bfcc62ed7066268349e8e7955925bdaf4be0eec0)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

mm: introduce kv[mz]alloc helpers

Patch series "kvmalloc", v5.

There are many open coded kmalloc with vmalloc fallback instances in the
tree.  Most of them are not careful enough or simply do not care about
the underlying semantic of the kmalloc/page allocator which means that
a) some vmalloc fallbacks are basically unreachable because the kmalloc
part will keep retrying until it succeeds b) the page allocator can
invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.

As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.

Most callers, I could find, have been converted to use the helper
instead.  This is patch 6.  There are some more relying on __GFP_REPEAT
in the networking stack which I have converted as well and Eric Dumazet
was not opposed [2] to convert them as well.

[1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com

This patch (of 9):

Using kmalloc with the vmalloc fallback for larger allocations is a
common pattern in the kernel code.  Yet we do not have any common helper
for that and so users have invented their own helpers.  Some of them are
really creative when doing so.  Let's just add kv[mz]alloc and make sure
it is implemented properly.  This implementation makes sure to not make
a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
to not warn about allocation failures.  This also rules out the OOM
killer as the vmalloc is a more approapriate fallback than a disruptive
user visible action.

This patch also changes some existing users and removes helpers which
are specific for them.  In some cases this is not possible (e.g.
ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
require GFP_NO{FS,IO} context which is not vmalloc compatible in general
(note that the page table allocation is GFP_KERNEL).  Those need to be
fixed separately.

While we are at it, document that __vmalloc{_node} about unsupported gfp
mask because there seems to be a lot of confusion out there.
kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
superset) flags to catch new abusers.  Existing ones would have to die
slowly.

[sfr@canb.auug.org.au: f2fs fixup]
Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Andreas Dilger <adilger@dilger.ca> [ext4 part]
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 26473178
(cherry picked from commit a7c3e901a46ff54c016d040847eda598a9e3e653)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Conflicts:

arch/x86/kvm/lapic.c
arch/x86/kvm/page_track.c
arch/x86/kvm/x86.c
fs/f2fs/f2fs.h
fs/f2fs/file.c
fs/f2fs/node.c
fs/f2fs/segment.c
fs/seq_file.c
security/apparmor/apparmorfs.c
security/apparmor/include/lib.h
security/apparmor/lib.c
security/apparmor/policy_unpack.c
virt/kvm/kvm_main.c

sg: Fix double-free when drives detach during SG_IO

In sg_common_write(), we free the block request and return -ENODEV if
the device is detached in the middle of the SG_IO ioctl().

Unfortunately, sg_finish_rem_req() also tries to free srp->rq, so we
end up freeing rq->cmd in the already free rq object, and then free
the object itself out from under the current user.

This ends up corrupting random memory via the list_head on the rq
object. The most common crash trace I saw is this:

  ------------[ cut here ]------------
  kernel BUG at block/blk-core.c:1420!
  Call Trace:
  [<ffffffff81281eab>] blk_put_request+0x5b/0x80
  [<ffffffffa0069e5b>] sg_finish_rem_req+0x6b/0x120 [sg]
  [<ffffffffa006bcb9>] sg_common_write.isra.14+0x459/0x5a0 [sg]
  [<ffffffff8125b328>] ? selinux_file_alloc_security+0x48/0x70
  [<ffffffffa006bf95>] sg_new_write.isra.17+0x195/0x2d0 [sg]
  [<ffffffffa006cef4>] sg_ioctl+0x644/0xdb0 [sg]
  [<ffffffff81170f80>] do_vfs_ioctl+0x90/0x520
  [<ffffffff81258967>] ? file_has_perm+0x97/0xb0
  [<ffffffff811714a1>] SyS_ioctl+0x91/0xb0
  [<ffffffff81602afb>] tracesys+0xdd/0xe2
    RIP [<ffffffff81281e04>] __blk_put_request+0x154/0x1a0

The solution is straightforward: just set srp->rq to NULL in the
failure branch so that sg_finish_rem_req() doesn't attempt to re-free
it.

Additionally, since sg_rq_end_io() will never be called on the object
when this happens, we need to free memory backing ->cmd if it isn't
embedded in the object itself.

KASAN was extremely helpful in finding the root cause of this bug.

Signed-off-by: Calvin Owens <calvinowens@fb.com>
Acked-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 26492266
(cherry picked from commit f3951a3709ff50990bf3e188c27d346792103432)
Signed-off-by: John Sobecki <john.sobecki@oracle.com>

scsi: smartpqi: mark PM functions as __maybe_unused

Orabug: 26191021, 26447813

The newly added suspend/resume support causes harmless warnings when
CONFIG_PM is disabled:

smartpqi/smartpqi_init.c:5147:12: error: 'pqi_ctrl_wait_for_pending_io' defined but not used [-Werror=unused-function]
smartpqi/smartpqi_init.c:2019:13: error: 'pqi_wait_until_lun_reset_finished' defined but not used [-Werror=unused-function]
smartpqi/smartpqi_init.c:2013:13: error: 'pqi_wait_until_scan_finished' defined but not used [-Werror=unused-function]

We can avoid the warnings by removing the #ifdef around the handlers and
instead marking them as __maybe_unused, which will let gcc drop the
unused code silently.

Fixes: f44d210312a6 ("scsi: smartpqi: add suspend and resume support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 5c146686e32085e76ad9e2957f3dee9b28fe4f22)
Signed-off-by: Brian Maly <brian.maly@oracle.com>