www.infradead.org Git - users/jedix/linux-maple.git/log

kvm: x86: reduce collisions in mmu_page_hash

When using two-dimensional paging, the mmu_page_hash (which provides
lookups for existing kvm_mmu_page structs), becomes imbalanced; with
too many collisions in buckets 0 and 512. This has been seen to cause
mmu_lock to be held for multiple milliseconds in kvm_mmu_get_page on
VMs with a large amount of RAM mapped with 4K pages.

The current hash function uses the lower 10 bits of gfn to index into
mmu_page_hash. When doing shadow paging, gfn is the address of the
guest page table being shadow. These tables are 4K-aligned, which
makes the low bits of gfn a good hash. However, with two-dimensional
paging, no guest page tables are being shadowed, so gfn is the base
address that is mapped by the table. Thus page tables (level=1) have
a 2MB aligned gfn, page directories (level=2) have a 1GB aligned gfn,
etc. This means hashes will only differ in their 10th bit.

hash_64() provides a better hash. For example, on a VM with ~200G
(99458 direct=1 kvm_mmu_page structs):

hash            max_mmu_page_hash_collisions
--------------------------------------------
low 10 bits     49847
hash_64         105
perfect         97

While we're changing the hash, increase the table size by 4x to better
support large VMs (further reduces number of collisions in 200G VM to
29).

Note that hash_64() does not provide a good distribution prior to commit
ef703f49a6c5 ("Eliminate bad hash multipliers from hash_32() and
hash_64()").

Signed-off-by: David Matlack <dmatlack@google.com>
Change-Id: I5aa6b13c834722813c6cca46b8b1ed6f53368ade
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Orabug: 26628797
(cherry picked from commit 114df303a7eeae8b50ebf68229b7e647714a9bea)
Signed-off-by: Govinda Tatti <Govinda.Tatti@Oracle.COM>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Tested-by: Eyal Moscovici <eyal.moscovici@oracle.com> [Ravello/BMCS Team]

IB/ipoib: For sendonly join free the multicast group on leave

Orabug: 26324050

When we leave the multicast group on expiration of a neighbor we
do not free the mcast structure. This results in a memory leak
that causes ib_dealloc_pd to fail and print a WARN_ON message
and backtrace.

Fixes: bd99b2e05c4d (IB/ipoib: Expire sendonly multicast joins)
Signed-off-by: Christoph Lameter <cl@linux.com>
Tested-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit 0b5c9279e568d90903acedc2b9b832d8d78e8288)

Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>

IB/ipoib: increase the max mcast backlog queue

Orabug: 26324050

When performing sendonly joins, we queue the packets that trigger
the join until the join completes.  This may take on the order of
hundreds of milliseconds.  It is easy to have many more than three
packets come in during that time.  Expand the maximum queue depth
in order to try and prevent dropped packets during the time it
takes to join the multicast group.

Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit 2866196f294954ce9fa226825c8c1eaa64c7da8a)

Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>

IB/ipoib: Make sendonly multicast joins create the mcast group

Orabug: 26324050

Since IPoIB should, as much as possible, emulate how multicast
sends work on Ethernet for regular TCP/IP apps, there should be
no requirement to subscribe to a multicast group before your
sends are properly sent.  However, due to the difference in how
multicast is handled on InfiniBand, we must join the appropriate
multicast group before we can send to it.  Previously we tried
not to trigger the auto-create feature of the subnet manager when
doing this because we didn't have tracking of these sendonly
groups and the auto-creation might never get undone.  The previous
patch added timing to these sendonly joins and allows us to
leave them after a reasonable idle expiration time.  So supply
all of the information needed to auto-create group.

Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit c3852ab0e606212de523c1fb1e15adbf9f431619)

Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>

IB/ipoib: Expire sendonly multicast joins

Orabug: 26324050

On neighbor expiration, check to see if the neighbor was actually a
sendonly multicast join, and if so, leave the multicast group as we
expire the neighbor.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit bd99b2e05c4df2a428e5c9dd338289089d0e26df)

Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>

IB/ipoib: Suppress warning for send only join failures

Orabug: 26324050

We expect send only joins to fail, it just means there are no listeners
for the group. The correct thing to do is silently drop the packet
at source.

Eg avahi will full join 224.0.0.251 which causes a send only IGMP packet
to 224.0.0.22, and then a warning level kmessage like this:

ib0: sendonly multicast join failed for ff12:401b:ffff:0000:0000:0000:0000:0016, status -22

If there is no IP router listening to IGMP.

Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit d1178cbcdcf91900ccf10a177350d7945703c151)

Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>

IB/ipoib: Clean up send-only multicast joins

Orabug: 26324050

Even though we don't expect the group to be created by the SM we
sill need to provide all the parameters to force the SM to validate
they are correct.

Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit c3acdc06a95ff20d920220ecb931186b0bb22c42)

Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>

fs/exec.c: account for argv/envp pointers

Orabug: 26365008
CVE: CVE-2017-1000365

When limiting the argv/envp strings during exec to 1/4 of the stack limit,
the storage of the pointers to the strings was not included. This means
that an exec with huge numbers of tiny strings could eat 1/4 of the stack
limit in strings and then additional space would be later used by the
pointers to the strings.

For example, on 32-bit with a 8MB stack rlimit, an exec with 1677721
single-byte strings would consume less than 2MB of stack, the max (8MB /
4) amount allowed, but the pointers to the strings would consume the
remaining additional stack space (1677721 * 4 == 6710884).

The result (1677721 + 6710884 == 8388605) would exhaust stack space
entirely. Controlling this stack exhaustion could result in
pathological behavior in setuid binaries (CVE-2017-1000365).

[akpm@linux-foundation.org: additional commenting from Kees]
Fixes: b6a2fea39318 ("mm: variable length argument support")
Link: http://lkml.kernel.org/r/20170622001720.GA32173@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Qualys Security Advisory <qsa@qualys.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 98da7d08850fb8bdeb395d6368ed15753304aa0c)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

sched/core: Use load_avg for selecting idlest group

find_idlest_group() only compares the runnable_load_avg when looking
for the least loaded group. But on fork intensive use case like
hackbench where tasks blocked quickly after the fork, this can lead to
selecting the same CPU instead of other CPUs, which have similar
runnable load but a lower load_avg.

When the runnable_load_avg of 2 CPUs are close, we now take into
account the amount of blocked load as a 2nd selection factor. There is
now 3 zones for the runnable_load of the rq:

- [0 .. (runnable_load - imbalance)]:
Select the new rq which has significantly less runnable_load

- [(runnable_load - imbalance) .. (runnable_load + imbalance)]:
The runnable loads are close so we use load_avg to chose
between the 2 rq

- [(runnable_load + imbalance) .. ULONG_MAX]:
Keep the current rq which has significantly less runnable_load

The scale factor that is currently used for comparing runnable_load,
doesn't work well with small value. As an example, the use of a
scaling factor fails as soon as this_runnable_load == 0 because we
always select local rq even if min_runnable_load is only 1, which
doesn't really make sense because they are just the same. So instead
of scaling factor, we use an absolute margin for runnable_load to
detect CPUs with similar runnable_load and we keep using scaling
factor for blocked load.

For use case like hackbench, this enable the scheduler to select
different CPUs during the fork sequence and to spread tasks across the
system.

Tests have been done on a Hikey board (ARM based octo cores) for
several kernel. The result below gives min, max, avg and stdev values
of 18 runs with each configuration.

The patches depend on the "no missing update_rq_clock()" work.

hackbench -P -g 1

         ea86cb4b7621  7dc603c9028e  v4.8        v4.8+patches
  min    0.049         0.050         0.051       0,048
  avg    0.057         0.057(0%)     0.057(0%)   0,055(+5%)
  max    0.066         0.068         0.070       0,063
  stdev  +/-9%         +/-9%         +/-8%       +/-9%

More performance numbers here:

  https://lkml.kernel.org/r/20161203214707.GI20785@codeblueprint.co.uk

Orabug: 25862897

Tested-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten.Rasmussen@arm.com
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: kernellwp@gmail.com
Cc: umgwanakikbuti@gmail.com
Cc: yuyang.du@intel.comc
Link: http://lkml.kernel.org/r/1481216215-24651-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 6b94780e45c17b83e3e75f8aaca5a328db583c74)
Conflicts:
kernel/sched/fair.c
Signed-off-by: subhra mazumdar <subhra.mazumdar@oracle.com>
Reviewed-by: Atish Patra <atish.patra@oracle.com>

dentry name snapshots

Orabug: 26630800
CVE: CVE-2017-7533

take_dentry_name_snapshot() takes a safe snapshot of dentry name;
if the name is a short one, it gets copied into caller-supplied
structure, otherwise an extra reference to external name is grabbed
(those are never modified). In either case the pointer to stable
string is stored into the same structure.

dentry must be held by the caller of take_dentry_name_snapshot(),
but may be freely dropped afterwards - the snapshot will stay
until destroyed by release_dentry_name_snapshot().

Intended use:
struct name_snapshot s;

take_dentry_name_snapshot(&s, dentry);
...
access s.name
...
release_dentry_name_snapshot(&s);

Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name
to pass down with event.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 49d31c2f389acfe83417083e1208422b4091cd9e)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

NFSv4.1: Use seqid returned by EXCHANGE_ID after state migration

Transparent State Migration copies a client's lease state from the
server where a filesystem used to reside to the server where it now
resides. When an NFSv4.1 client first contacts that destination
server, it uses EXCHANGE_ID to detect trunking relationships.

The lease that was copied there is returned to that client, but the
destination server sets EXCHGID4_FLAG_CONFIRMED_R when replying to
the client. This is because the lease was confirmed on the source
server (before it was copied).

Normally, when CONFIRMED_R is set, a client purges the lease and
creates a new one. However, that throws away the entire benefit of
Transparent State Migration.

Therefore, the client must use the contrived slot sequence value
returned by the destination server for its first CREATE_SESSION
operation after a Transparent State Migration.

Orabug: 25802443
Reported-by: Xuan Qi <xuan.qi@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
(hand picked mainline 838edb9 NFSv4.1: Use seqid returned ...)
Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>

ipv6: fix out of bound writes in __ip6_append_data()

Orabug: 26575181
CVE: CVE-2017-9242

Andrey Konovalov and idaifish@gmail.com reported crashes caused by
one skb shared_info being overwritten from __ip6_append_data()

Andrey program lead to following state :

copy -4200 datalen 2000 fraglen 2040
maxfraglen 2040 alloclen 2048 transhdrlen 0 offset 0 fraggap 6200

The skb_copy_and_csum_bits(skb_prev, maxfraglen, data + transhdrlen,
fraggap, 0); is overwriting skb->head and skb_shared_info

Since we apparently detect this rare condition too late, move the
code earlier to even avoid allocating skb and risking crashes.

Once again, many thanks to Andrey and syzkaller team.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: <idaifish@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 232cd35d0804cc241eb887bb8d4d9b3b9881c64a)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
net/ipv6/ip6_output.c

mnt: Add a per mount namespace limit on the number of mounts

Orabug: 26575596
CVE: CVE-2016-6213

CAI Qian <caiqian@redhat.com> pointed out that the semantics
of shared subtrees make it possible to create an exponentially
increasing number of mounts in a mount namespace.

    mkdir /tmp/1 /tmp/2
    mount --make-rshared /
    for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

Will create create 2^20 or 1048576 mounts, which is a practical problem
as some people have managed to hit this by accident.

As such CVE-2016-6213 was assigned.

Ian Kent <raven@themaw.net> described the situation for autofs users
as follows:

> The number of mounts for direct mount maps is usually not very large because of
> the way they are implemented, large direct mount maps can have performance
> problems. There can be anywhere from a few (likely case a few hundred) to less
> than 10000, plus mounts that have been triggered and not yet expired.
>
> Indirect mounts have one autofs mount at the root plus the number of mounts that
> have been triggered and not yet expired.
>
> The number of autofs indirect map entries can range from a few to the common
> case of several thousand and in rare cases up to between 30000 and 50000. I've
> not heard of people with maps larger than 50000 entries.
>
> The larger the number of map entries the greater the possibility for a large
> number of active mounts so it's not hard to expect cases of a 1000 or somewhat
> more active mounts.

So I am setting the default number of mounts allowed per mount
namespace at 100,000.  This is more than enough for any use case I
know of, but small enough to quickly stop an exponential increase
in mounts.  Which should be perfect to catch misconfigurations and
malfunctioning programs.

For anyone who needs a higher limit this can be changed by writing
to the new /proc/sys/fs/mount-max sysctl.

Tested-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
(cherry picked from commit d29216842a85c7970c536108e093963f02714498)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/namespace.c
kernel/sysctl.c

l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()

Lock socket before checking the SOCK_ZAPPED flag in l2tp_ip6_bind().
Without lock, a concurrent call could modify the socket flags between
the sock_flag(sk, SOCK_ZAPPED) test and the lock_sock() call. This way,
a socket could be inserted twice in l2tp_ip6_bind_table. Releasing it
would then leave a stale pointer there, generating use-after-free
errors when walking through the list or modifying adjacent entries.

BUG: KASAN: use-after-free in l2tp_ip6_close+0x22e/0x290 at addr ffff8800081b0ed8
Write of size 8 by task syz-executor/10987
CPU: 0 PID: 10987 Comm: syz-executor Not tainted 4.8.0+ #39
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
ffff880031d97838 ffffffff829f835b ffff88001b5a1640 ffff8800081b0ec0
ffff8800081b15a0 ffff8800081b6d20 ffff880031d97860 ffffffff8174d3cc
ffff880031d978f0 ffff8800081b0e80 ffff88001b5a1640 ffff880031d978e0
Call Trace:
[<ffffffff829f835b>] dump_stack+0xb3/0x118 lib/dump_stack.c:15
[<ffffffff8174d3cc>] kasan_object_err+0x1c/0x70 mm/kasan/report.c:156
[<     inline     >] print_address_description mm/kasan/report.c:194
[<ffffffff8174d666>] kasan_report_error+0x1f6/0x4d0 mm/kasan/report.c:283
[<     inline     >] kasan_report mm/kasan/report.c:303
[<ffffffff8174db7e>] __asan_report_store8_noabort+0x3e/0x40 mm/kasan/report.c:329
[<     inline     >] __write_once_size ./include/linux/compiler.h:249
[<     inline     >] __hlist_del ./include/linux/list.h:622
[<     inline     >] hlist_del_init ./include/linux/list.h:637
[<ffffffff8579047e>] l2tp_ip6_close+0x22e/0x290 net/l2tp/l2tp_ip6.c:239
[<ffffffff850b2dfd>] inet_release+0xed/0x1c0 net/ipv4/af_inet.c:415
[<ffffffff851dc5a0>] inet6_release+0x50/0x70 net/ipv6/af_inet6.c:422
[<ffffffff84c4581d>] sock_release+0x8d/0x1d0 net/socket.c:570
[<ffffffff84c45976>] sock_close+0x16/0x20 net/socket.c:1017
[<ffffffff817a108c>] __fput+0x28c/0x780 fs/file_table.c:208
[<ffffffff817a1605>] ____fput+0x15/0x20 fs/file_table.c:244
[<ffffffff813774f9>] task_work_run+0xf9/0x170
[<ffffffff81324aae>] do_exit+0x85e/0x2a00
[<ffffffff81326dc8>] do_group_exit+0x108/0x330
[<ffffffff81348cf7>] get_signal+0x617/0x17a0 kernel/signal.c:2307
[<ffffffff811b49af>] do_signal+0x7f/0x18f0
[<ffffffff810039bf>] exit_to_usermode_loop+0xbf/0x150 arch/x86/entry/common.c:156
[<     inline     >] prepare_exit_to_usermode arch/x86/entry/common.c:190
[<ffffffff81006060>] syscall_return_slowpath+0x1a0/0x1e0 arch/x86/entry/common.c:259
[<ffffffff85e4d726>] entry_SYSCALL_64_fastpath+0xc4/0xc6
Object at ffff8800081b0ec0, in cache L2TP/IPv6 size: 1448
Allocated:
PID = 10987
[ 1116.897025] [<ffffffff811ddcb6>] save_stack_trace+0x16/0x20
[ 1116.897025] [<ffffffff8174c736>] save_stack+0x46/0xd0
[ 1116.897025] [<ffffffff8174c9ad>] kasan_kmalloc+0xad/0xe0
[ 1116.897025] [<ffffffff8174cee2>] kasan_slab_alloc+0x12/0x20
[ 1116.897025] [<     inline     >] slab_post_alloc_hook mm/slab.h:417
[ 1116.897025] [<     inline     >] slab_alloc_node mm/slub.c:2708
[ 1116.897025] [<     inline     >] slab_alloc mm/slub.c:2716
[ 1116.897025] [<ffffffff817476a8>] kmem_cache_alloc+0xc8/0x2b0 mm/slub.c:2721
[ 1116.897025] [<ffffffff84c4f6a9>] sk_prot_alloc+0x69/0x2b0 net/core/sock.c:1326
[ 1116.897025] [<ffffffff84c58ac8>] sk_alloc+0x38/0xae0 net/core/sock.c:1388
[ 1116.897025] [<ffffffff851ddf67>] inet6_create+0x2d7/0x1000 net/ipv6/af_inet6.c:182
[ 1116.897025] [<ffffffff84c4af7b>] __sock_create+0x37b/0x640 net/socket.c:1153
[ 1116.897025] [<     inline     >] sock_create net/socket.c:1193
[ 1116.897025] [<     inline     >] SYSC_socket net/socket.c:1223
[ 1116.897025] [<ffffffff84c4b46f>] SyS_socket+0xef/0x1b0 net/socket.c:1203
[ 1116.897025] [<ffffffff85e4d685>] entry_SYSCALL_64_fastpath+0x23/0xc6
Freed:
PID = 10987
[ 1116.897025] [<ffffffff811ddcb6>] save_stack_trace+0x16/0x20
[ 1116.897025] [<ffffffff8174c736>] save_stack+0x46/0xd0
[ 1116.897025] [<ffffffff8174cf61>] kasan_slab_free+0x71/0xb0
[ 1116.897025] [<     inline     >] slab_free_hook mm/slub.c:1352
[ 1116.897025] [<     inline     >] slab_free_freelist_hook mm/slub.c:1374
[ 1116.897025] [<     inline     >] slab_free mm/slub.c:2951
[ 1116.897025] [<ffffffff81748b28>] kmem_cache_free+0xc8/0x330 mm/slub.c:2973
[ 1116.897025] [<     inline     >] sk_prot_free net/core/sock.c:1369
[ 1116.897025] [<ffffffff84c541eb>] __sk_destruct+0x32b/0x4f0 net/core/sock.c:1444
[ 1116.897025] [<ffffffff84c5aca4>] sk_destruct+0x44/0x80 net/core/sock.c:1452
[ 1116.897025] [<ffffffff84c5ad33>] __sk_free+0x53/0x220 net/core/sock.c:1460
[ 1116.897025] [<ffffffff84c5af23>] sk_free+0x23/0x30 net/core/sock.c:1471
[ 1116.897025] [<ffffffff84c5cb6c>] sk_common_release+0x28c/0x3e0 ./include/net/sock.h:1589
[ 1116.897025] [<ffffffff8579044e>] l2tp_ip6_close+0x1fe/0x290 net/l2tp/l2tp_ip6.c:243
[ 1116.897025] [<ffffffff850b2dfd>] inet_release+0xed/0x1c0 net/ipv4/af_inet.c:415
[ 1116.897025] [<ffffffff851dc5a0>] inet6_release+0x50/0x70 net/ipv6/af_inet6.c:422
[ 1116.897025] [<ffffffff84c4581d>] sock_release+0x8d/0x1d0 net/socket.c:570
[ 1116.897025] [<ffffffff84c45976>] sock_close+0x16/0x20 net/socket.c:1017
[ 1116.897025] [<ffffffff817a108c>] __fput+0x28c/0x780 fs/file_table.c:208
[ 1116.897025] [<ffffffff817a1605>] ____fput+0x15/0x20 fs/file_table.c:244
[ 1116.897025] [<ffffffff813774f9>] task_work_run+0xf9/0x170
[ 1116.897025] [<ffffffff81324aae>] do_exit+0x85e/0x2a00
[ 1116.897025] [<ffffffff81326dc8>] do_group_exit+0x108/0x330
[ 1116.897025] [<ffffffff81348cf7>] get_signal+0x617/0x17a0 kernel/signal.c:2307
[ 1116.897025] [<ffffffff811b49af>] do_signal+0x7f/0x18f0
[ 1116.897025] [<ffffffff810039bf>] exit_to_usermode_loop+0xbf/0x150 arch/x86/entry/common.c:156
[ 1116.897025] [<     inline     >] prepare_exit_to_usermode arch/x86/entry/common.c:190
[ 1116.897025] [<ffffffff81006060>] syscall_return_slowpath+0x1a0/0x1e0 arch/x86/entry/common.c:259
[ 1116.897025] [<ffffffff85e4d726>] entry_SYSCALL_64_fastpath+0xc4/0xc6
Memory state around the buggy address:
ffff8800081b0d80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8800081b0e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff8800081b0e80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                                    ^
ffff8800081b0f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8800081b0f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

==================================================================

The same issue exists with l2tp_ip_bind() and l2tp_ip_bind_table.

Fixes: c51ce49735c1 ("l2tp: fix oops in L2TP IP sockets for connect() AF_UNSPEC case")
Reported-by: Baozeng Ding <sploving1@gmail.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 32c231164b762dddefa13af5a0101032c70b50ef)

Orabug: 26575341
CVE: CVE-2016-10200

Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

KEYS: Disallow keyrings beginning with '.' to be joined as session keyrings

This fixes CVE-2016-9604.

Keyrings whose name begin with a '.' are special internal keyrings and so
userspace isn't allowed to create keyrings by this name to prevent
shadowing.  However, the patch that added the guard didn't fix
KEYCTL_JOIN_SESSION_KEYRING.  Not only can that create dot-named keyrings,
it can also subscribe to them as a session keyring if they grant SEARCH
permission to the user.

This, for example, allows a root process to set .builtin_trusted_keys as
its session keyring, at which point it has full access because now the
possessor permissions are added.  This permits root to add extra public
keys, thereby bypassing module verification.

This also affects kexec and IMA.

This can be tested by (as root):

keyctl session .builtin_trusted_keys
keyctl add user a a @s
keyctl list @s

which on my test box gives me:

2 keys in keyring:
180010936: ---lswrv     0     0 asymmetric: Build time autogenerated kernel key: ae3d4a31b82daa8e1a75b49dc2bba949fd992a05
801382539: --alswrv     0     0 user: a

Fix this by rejecting names beginning with a '.' in the keyctl.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
cc: linux-ima-devel@lists.sourceforge.net
cc: stable@vger.kernel.org
(cherry picked from commit ee8f844e3c5a73b999edf733df1c529d6503ec2f)

Orabug: 26575534
CVE: CVE-2016-9604

Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

sctp: do not inherit ipv6_{mc|ac|fl}_list from parent

SCTP needs fixes similar to 83eaddab4378 ("ipv6/dccp: do not inherit
ipv6_mc_list from parent"), otherwise bad things can happen.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit fdcee2cbb8438702ea1b328fb6e0ac5e9a40c7f8)

Orabug: 26107745
CVE: CVE-2017-9075

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

crypto: algif_hash - Fix result clobbering in recvmsg

Recently an init call was added to hash_recvmsg so as to reset
the hash state in case a sendmsg call was never made.

Unfortunately this ended up clobbering the result if the previous
sendmsg was done with a MSG_MORE flag. This patch fixes it by
excluding that case when we make the init call.

Fixes: a8348bca2944 ("algif_hash - Fix NULL hash crash with shash")
Reported-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 8acf7a106326eb94e143552de81f34308149121c)

Orabug: 25698521

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

crypto: algif_hash - Fix NULL hash crash with shash

Recently algif_hash has been changed to allow null hashes. This
triggers a bug when used with an shash algorithm whereby it will
cause a crash during the digest operation.

This patch fixes it by avoiding the digest operation and instead
doing an init followed by a final which avoids the buggy code in
shash.

This patch also ensures that the result buffer is freed after an
error so that it is not returned as a genuine hash result on the
next recv call.

The shash/ahash wrapper code will be fixed later to handle this
case correctly.

Fixes: 493b2ed3f760 ("crypto: algif_hash - Handle NULL hashes correctly")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Laura Abbott <labbott@redhat.com>
(cherry picked from commit a8348bca2944d397a528772f5c0ccb47a8b58af4)

Orabug: 25698521

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

crypto: algif_hash - Handle NULL hashes correctly

Right now attempting to read an empty hash simply returns zeroed
bytes, this patch corrects this by calling the digest function
using an empty input.

Reported-by: Russell King - ARM Linux <linux@armlinux.org.uk>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 493b2ed3f7603a15ff738553384d5a4510ffeb95)

Orabug: 25698521

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

x86/irq: Protect smp_cleanup_move

smp_cleanup_move fiddles without protection in the interrupt
descriptors and the vector array. A concurrent irq setup/teardown or
affinity setting can pull the rug under that operation.

Add proper locking.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Link: http://lkml.kernel.org/r/20150802203609.222975294@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit df54c4934e030e73cb6a7bd6713f697350dabd0b)

Orabug: 25677661

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Jack vogel <jack.vogel@oracle.com>
Conflicts:
arch/x86/kernel/apic/vector.c

i40e/i40evf: check for stopped admin queue

It's possible that while we are waiting for the spinlock, another
entity (that owns the spinlock) has shut down the admin queue.
If we then attempt to use the queue, we will panic.

Add a check for this condition on the receive side. This matches
an existing check on the send queue side.

Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 26654196
(cherry picked from commit 43ae93a93e8c95c5e6389dc8e11704712b1ab2e9)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>

Btrfs: fix extent_same allowing destination offset beyond i_size

When using the same file as the source and destination for a dedup
(extent_same ioctl) operation we were allowing it to dedup to a
destination offset beyond the file's size, which doesn't make sense and
it's not allowed for the case where the source and destination files are
not the same file. This made de deduplication operation successful only
when the source range corresponded to a hole, a prealloc extent or an
extent with all bytes having a value of 0x00. This was also leaving a
file hole (between i_size and destination offset) without the
corresponding file extent items, which can be reproduced with the
following steps for example:

  $ mkfs.btrfs -f /dev/sdi
  $ mount /dev/sdi /mnt/sdi

  $ xfs_io -f -c "pwrite -S 0xab 304457 404990" /mnt/sdi/foobar
  wrote 404990/404990 bytes at offset 304457
  395 KiB, 99 ops; 0.0000 sec (31.150 MiB/sec and 7984.5149 ops/sec)

  $ /git/hub/duperemove/btrfs-extent-same 24576 /mnt/sdi/foobar 28672 /mnt/sdi/foobar 929792
  Deduping 2 total files
  (28672, 24576): /mnt/sdi/foobar
  (929792, 24576): /mnt/sdi/foobar
  1 files asked to be deduped
  i: 0, status: 0, bytes_deduped: 24576
  24576 total bytes deduped in this operation

  $ umount /mnt/sdi
  $ btrfsck /dev/sdi
  Checking filesystem on /dev/sdi
  UUID: 98c528aa-0833-427d-9403-b98032ffbf9d
  checking extents
  checking free space cache
  checking fs roots
  root 5 inode 257 errors 100, file extent discount
  Found file extent holes:
          start: 712704, len: 217088
  found 540673 bytes used err is 1
  total csum bytes: 400
  total tree bytes: 131072
  total fs tree bytes: 32768
  total extent tree bytes: 16384
  btree space waste bytes: 123675
  file data blocks allocated: 671744
    referenced 671744
  btrfs-progs v4.2.3

So fix this by not allowing the destination to go beyond the file's size,
just as we do for the same where the source and destination files are not
the same.

A test for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Orabug: 26441487

(cherry picked from commit f4dfe6871006c62abdccc77b2818b11f376e98e2)
Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

btrfs: fix clone / extent-same deadlocks

Clone and extent same lock their source and target inodes in opposite order.
In addition to this, the range locking in clone doesn't take ordering into
account. Fix this by having clone use the same locking helpers as
btrfs-extent-same.

In addition, I do a small cleanup of the locking helpers, removing a case
(both inodes being the same) which was poorly accounted for and never
actually used by the callers.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
(Cherry picked from commit 293a8489f300536dc6d996c35a6ebb89aa03bab2)

Orabug: 26251039

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: John Haxby <john.haxby@oracle.com>

btrfs: don't update mtime/ctime on deduped inodes

One issue users have reported is that dedupe changes mtime on files,
resulting in tools like rsync thinking that their contents have changed when
in fact the data is exactly the same. We also skip the ctime update as no
user-visible metadata changes here and we want dedupe to be transparent to
the user.

Clone still wants time changes, so we special case this in the code.

This was tested with the btrfs-extent-same tool.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Chris Mason <clm@fb.com>
(Cherry picked from commit 1c919a5e13702caffbe2d2c7c305f9d0d2925160)

Orabug: 26251039

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: John Haxby <john.haxby@oracle.com>

btrfs: allow dedupe of same inode

clone() supports cloning within an inode so extent-same can do
the same now. This patch fixes up the locking in extent-same to
know about the single-inode case. In addition to that, we add a
check for overlapping ranges, which clone does not allow.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
(Cherry picked from commit 0efa9f48c7e6c15e75946dd2b1c82d3d19e13545)

Orabug: 26251039

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: John Haxby <john.haxby@oracle.com>

btrfs: fix deadlock with extent-same and readpage

->readpage() does page_lock() before extent_lock(), we do the opposite in
extent-same. We want to reverse the order in btrfs_extent_same() but it's
not quite straightforward since the page locks are taken inside btrfs_cmp_data().

So I split btrfs_cmp_data() into 3 parts with a small context structure that
is passed between them. The first, btrfs_cmp_data_prepare() gathers up the
pages needed (taking page lock as required) and puts them on our context
structure. At this point, we are safe to lock the extent range. Afterwards,
we use btrfs_cmp_data() to do the data compare as usual and btrfs_cmp_data_free()
to clean up our context.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
(Cherry picked from commit f441460202cb787c49963bcc1f54cb48c52f7512)

Orabug: 26251039

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: John Haxby <john.haxby@oracle.com>

btrfs: pass unaligned length to btrfs_cmp_data()

In the case that we dedupe the tail of a file, we might expand the dedupe
len out to the end of our last block. We don't want to compare data past
i_size however, so pass the original length to btrfs_cmp_data().

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
(Cherry picked from commit 207910ddeeda38fd54544d94f8c8ca5a9632cc25)

Orabug: 26251039

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: John Haxby <john.haxby@oracle.com>

xen: fix bio vec merging

The current test for bio vec merging is not fully accurate and can be
tricked into merging bios when certain grant combinations are used.
The result of these malicious bio merges is a bio that extends past
the memory page used by any of the originating bios.

Take into account the following scenario, where a guest creates two
grant references that point to the same mfn, ie: grant 1 -> mfn A,
grant 2 -> mfn A.

These references are then used in a PV block request, and mapped by
the backend domain, thus obtaining two different pfns that point to
the same mfn, pfn B -> mfn A, pfn C -> mfn A.

If those grants happen to be used in two consecutive sectors of a disk
IO operation becoming two different bios in the backend domain, the
checks in xen_biovec_phys_mergeable will succeed, because bfn1 == bfn2
(they both point to the same mfn). However due to the bio merging,
the backend domain will end up with a bio that expands past mfn A into
mfn A + 1.

Fix this by making sure the check in xen_biovec_phys_mergeable takes
into account the offset and the length of the bio, this basically
replicates whats done in __BIOVEC_PHYS_MERGEABLE using mfns (bus
addresses). While there also remove the usage of
__BIOVEC_PHYS_MERGEABLE, since that's already checked by the callers
of xen_biovec_phys_mergeable.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Orabug: 26564183
CVE: CVE-2017-12134

Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com>

ovl: verify upper dentry in ovl_remove_and_whiteout()

Orabug: 26175588

The upper dentry may become stale before we call ovl_lock_rename_workdir.
For example, someone could (mistakenly or maliciously) manually unlink(2)
it directly from upperdir.

To ensure it is not stale, let's lookup it after ovl_lock_rename_workdir
and and check if it matches the upper dentry.

Essentially, it is the same problem and similar solution as in
commit 11f3710417d0 ("ovl: verify upper dentry before unlink and rename").

Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
(cherry picked from commit cfc9fde0b07c3b44b570057c5f93dda59dca1c94)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

ovl: use O_LARGEFILE in ovl_copy_up()

Orabug: 26619890

Open the lower file with O_LARGEFILE in ovl_copy_up().

Pass O_LARGEFILE unconditionally in ovl_copy_up_data() as it's purely for
catching 32-bit userspace dealing with a file large enough that it'll be
mishandled if the application isn't aware that there might be an integer
overflow. Inside the kernel, there shouldn't be any problems.

Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org> # v3.18+
(cherry picked from commit 0480334fa60488d12ae101a02d7d9e1a3d03d7dd)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Ethan Zhao <ethan.zhao@oracle.com>

xfs: toggle readonly state around xfs_log_mount_finish

When we do log recovery on a readonly mount, unlinked inode
processing does not happen due to the readonly checks in
xfs_inactive(), which are trying to prevent any I/O on a
readonly mount.

This is misguided - we do I/O on readonly mounts all the time,
for consistency; for example, log recovery. So do the same
RDONLY flag twiddling around xfs_log_mount_finish() as we
do around xfs_log_mount(), for the same reason.

This all cries out for a big rework but for now this is a
simple fix to an obvious problem.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Orabug: 26630113

Link: https://patchwork.kernel.org/patch/9857173/
move readonly checking into the if statement to apply to the current kernel

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

xfs: write unmount record for ro mounts

There are dueling comments in the xfs code about intent
for log writes when unmounting a readonly filesystem.

In xfs_mountfs, we see the intent:

/*
* Now the log is fully replayed, we can transition to full read-only
* mode for read-only mounts. This will sync all the metadata and clean
* the log so that the recovery we just performed does not have to be
* replayed again on the next mount.
*/

and it calls xfs_quiesce_attr(), but by the time we get to
xfs_log_unmount_write(), it returns early for a RDONLY mount:

* Don't write out unmount record on read-only mounts.

Because of this, sequential ro mounts of a filesystem with
a dirty log will replay the log each time, which seems odd.

Fix this by writing an unmount record even for RO mounts, as long
as norecovery wasn't specified (don't write a clean log record
if a dirty log may still be there!) and the log device is
writable.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Orabug: 26630113

Link: https://patchwork.kernel.org/patch/9857169
Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

xfs: fix eofblocks race with file extending async dio writes

It's possible for post-eof blocks to end up being used for direct I/O
writes. dio write performs an upfront unwritten extent allocation, sends
the dio and then updates the inode size (if necessary) on write
completion. If a file release occurs while a file extending dio write is
in flight, it is possible to mistake the post-eof blocks for speculative
preallocation and incorrectly truncate them from the inode. This means
that the resulting dio write completion can discover a hole and allocate
new blocks rather than perform unwritten extent conversion.

This requires a strange mix of I/O and is thus not likely to reproduce
in real world workloads. It is intermittently reproduced by generic/299.
The error manifests as an assert failure due to transaction overrun
because the aforementioned write completion transaction has only
reserved enough blocks for btree operations:

XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, \
file: fs/xfs//xfs_trans.c, line: 309

The root cause is that xfs_free_eofblocks() uses i_size to truncate
post-eof blocks from the inode, but async, file extending direct writes
do not update i_size until write completion, long after inode locks are
dropped. Therefore, xfs_free_eofblocks() effectively truncates the inode
to the incorrect size.

Update xfs_free_eofblocks() to serialize against dio similar to how
extending writes are serialized against i_size updates before post-eof
block zeroing. Specifically, wait on dio while under the iolock. This
ensures that dio write completions have updated i_size before post-eof
blocks are processed.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Orabug: 26588811

(backport upstream commit e4229d6b0bc9280f29624faf170cf76a9f1ca60e)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

NVMe: Allocate queues only for online cpus

The driver previously requested allocating queues for the total possible
number of CPUs so that blk-mq could rebalance these if CPUs were added
after initialization. The number of hardware contexts can now be changed
at runtime, so we only need to allocate the number of online queues
since we can add more later.

Suggested-by: Jeff Lien <jeff.lien@hgst.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Orabug: 24675382
(cherry picked from commit 2800b8e7d9dfca1fd9d044dcf7a046b5de5a7239)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Conflicts:
drivers/nvme/host/pci.c

Reviewed-by: Govinda Tatti<Govinda.Tatti@oracle.com>
Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

KEYS: fix dereferencing NULL payload with nonzero length

sys_add_key() and the KEYCTL_UPDATE operation of sys_keyctl() allowed a
NULL payload with nonzero length to be passed to the key type's
->preparse(), ->instantiate(), and/or ->update() methods. Various key
types including asymmetric, cifs.idmap, cifs.spnego, and pkcs7_test did
not handle this case, allowing an unprivileged user to trivially cause a
NULL pointer dereference (kernel oops) if one of these key types was
present. Fix it by doing the copy_from_user() when 'plen' is nonzero
rather than when '_payload' is non-NULL, causing the syscall to fail
with EFAULT as expected when an invalid buffer is specified.

Cc: stable@vger.kernel.org # 2.6.10+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Orabug: 26591890
(cherry picked from commit 5649645d725c73df4302428ee4e02c869248b4c5)
Signed-off-by: Todd Vierling <todd.vierling@oracle.com>

IB/mlx4: Fix possible vl/sl field mismatch in LRH header in QP1 packets

In MLX qp packets, the LRH (built by the driver) has both a VL field
and an SL field. When building a QP1 packet, the VL field should
reflect the SLtoVL mapping and not arbitrarily contain zero (as is
done now). This bug causes credit problems in IB switches at
high rates of QP1 packets.

The fix is to cache the SL to VL mapping in the driver, and look up
the VL mapped to the SL provided in the send request when sending
QP1 packets.

For FW versions which support generating a port_management_config_change
event with subtype sl-to-vl-table-change, the driver uses that event
to update its sl-to-vl mapping cache.  Otherwise, the driver snoops
incoming SMP mads to update the cache.

There remains the case where the FW is running in secure-host mode
(so no QP0 packets are delivered to the driver), and the FW does not
generate the sl2vl mapping change event. To support this case, the
driver updates (via querying the FW) its sl2vl mapping cache when
running in secure-host mode when it receives either a Port Up event
or a client-reregister event (where the port is still up, but there
may have been an opensm failover).
OpenSM modifies the sl2vl mapping before Port Up and Client-reregister
events occur, so if there is a mapping change the driver's cache will
be properly updated.

Fixes: 225c7b1feef1 ("IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters")
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
[ Original fix ported from upstream commit fd10ed8 modified with
  changes from merge commit b9044ac and modifications in
  dump_dev_cap_flags2() likely to be made upstream for correctness
  (and verified with private communication with Mellanox) ]

Orabug: 26198210

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

Revert "IB/mlx4: Suppress warning for not handled portmgmt event subtype"

This reverts commit cef258494a818e96db0e4ed40ddd2c52e6084846.

Revert needed to add code for proper handling of the port
management change event in a subsequent commit (instead
of a workaround that suppresses an error message for
an unhandled event).

Orabug: 26198210

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>

scsi: megaraid_sas: handle dma_addr_t right on 32-bit

Orabug: 26608922

When building with a dma_addr_t that is different from pointer size, we
get this warning:

drivers/scsi/megaraid/megaraid_sas_fusion.c: In function 'megasas_make_prp_nvme':
drivers/scsi/megaraid/megaraid_sas_fusion.c:1654:17: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]

It's better to not pretend that the dma address is a pointer and instead
use a dma_addr_t consistently.

Fixes: 33203bc4d61b ("scsi: megaraid_sas: NVME fast path io support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Sumit Saxena <sumit.saxena@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit d1da522fb8a70b8c527d4ad15f9e62218cc00f2c)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

scsi: megaraid_sas: NVME fast path io support

Orabug: 26608922

This patch provide true fast path IO support. Driver creates PRP for
NVME drives and send Fast Path for performance. Certain h/w requirement
needs to be taken care in driver.

Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 33203bc4d61b33f1f7bb736eac0c6fdd20b92397)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
drivers/scsi/megaraid/megaraid_sas.h
drivers/scsi/megaraid/megaraid_sas_fp.c

scsi: megaraid_sas: NVME interface target prop added

Orabug: 26608922

This patch fetch true values of NVME property from FW using New DCMD
interface MR_DCMD_DEV_GET_TARGET_PROP

Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 96188a89cc6d5ad3a0a5b7a6c4abc9f4a6de7678)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

scsi: megaraid_sas: NVME Interface detection and prop settings

Orabug: 26608922

Adding detection logic for NVME device attached behind Ventura
controller.  Driver set HostPageSize in IOC_INIT frame to inform about
page size for NVME devices.  Firmware reports NVME page size to the
driver.  PD INFO DCMD provide new interface type NVME_PD. Driver set
property of NVME device.

Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ Upstream commit 15dd03811d99dcf828f4eeb2c2b6a02ddc1201c7 ]
[ Resolved the use of blk_queue_virt_boundary just like broadcom driver version ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
drivers/scsi/megaraid/megaraid_sas.h

scsi: megaraid_sas: Use synchronize_irq to wait for IRQs to complete

Orabug: 26608922

FIX - Do not use random delay to synchronize with IRQ. Use kernel API.

Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Tomas Henzl <thenzl@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ Upstream commit 29206da1490a7065e8a03ec43f6de60c5c978cae ]
[ Added drivers/scsi/megaraid/kcompat.h to resolve pci_irq_vector symbol ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

fs/fuse: fuse mount can cause panic with no memory numa node

Orabug: 26591421

Host panics with the below stack trace:-

[ 2509.835362] BUG: unable to handle kernel paging request at
0000000000001e08
[ 2509.843155] IP: [<ffffffff8119e1a7>]
__alloc_pages_nodemask+0xb7/0xa10
[ 2509.850466] PGD 7f5bbf5067 PUD 7f5c492067 PMD 0
..
..
[ 2510.108923] Call Trace:
[ 2510.111658]  [<ffffffff81022075>] ? native_sched_clock+0x35/0xa0
[ 2510.118360]  [<ffffffff810220e9>] ? sched_clock+0x9/0x10
[ 2510.124291]  [<ffffffff810b9ad5>] ? sched_clock_cpu+0x85/0xc0
[ 2510.130692]  [<ffffffff810b6227>] ? try_to_wake_up+0x47/0x370
[ 2510.137108]  [<ffffffff811f0c43>] ? deactivate_slab+0x383/0x400
[ 2510.143712]  [<ffffffff811f1e37>] new_slab+0xa7/0x460
[ 2510.149350]  [<ffffffff81733cc0>] __slab_alloc+0x317/0x477
[ 2510.155473]  [<ffffffffa0380336>] ? fuse_conn_init+0x236/0x420 [fuse]
[ 2510.162664]  [<ffffffff8146184c>] ? extract_entropy+0x14c/0x330
[ 2510.169260]  [<ffffffff814622b0>] ? get_random_bytes+0x40/0xa0
[ 2510.175769]  [<ffffffff811f2c75>] kmem_cache_alloc_node_trace+0x95/0x2b0
[ 2510.183248]  [<ffffffffa0380336>] fuse_conn_init+0x236/0x420 [fuse]
[ 2510.190241]  [<ffffffffa03808c0>] fuse_fill_super+0x3a0/0x700 [fuse]
[ 2510.197330]  [<ffffffff811f30f6>] ? __kmalloc+0x266/0x2c0
[ 2510.203355]  [<ffffffff811a6dcc>] ? register_shrinker+0x3c/0xa0
[ 2510.209961]  [<ffffffff81216728>] ? sget+0x3b8/0x3f0
[ 2510.215500]  [<ffffffff81215880>] ? get_anon_bdev+0x120/0x120
[ 2510.221911]  [<ffffffffa0380520>] ? fuse_conn_init+0x420/0x420 [fuse]
[ 2510.229096]  [<ffffffff8121686d>] mount_nodev+0x4d/0xb0
[ 2510.234927]  [<ffffffffa037ec38>] fuse_mount+0x18/0x20 [fuse]
[ 2510.241337]  [<ffffffff812173e9>] mount_fs+0x39/0x180
[ 2510.246975]  [<ffffffff8123366b>] vfs_kern_mount+0x6b/0x110
[ 2510.253191]  [<ffffffff81236411>] do_mount+0x251/0xcf0
[ 2510.258922]  [<ffffffff812371f2>] SyS_mount+0xa2/0x110
[ 2510.264653]  [<ffffffff81028666>] ? syscall_trace_leave+0xc6/0x150
[ 2510.271551]  [<ffffffff8173d96e>] system_call_fastpath+0x12/0x71

This commit reverts the changes done in 937287c2e4ab ("fs/fuse: Fix for
correct number of numa nodes") and fixes the original problem by calling
kmalloc_node/kmem_cache_alloc_node, only on the online numa nodes. For the
offline nodes, kmalloc or kmem_cache_alloc will be used for allocation.
This problem happens only with slub allocator because it does not do
fallback allocation if node is offline, unlike slab. And for that reason,
the same behavior is emulated here.

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Ashish Samant<ashish.samant@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>

Fix regression which breaks DFS mounting

Orabug: 26591404

Patch a6b5058 results in -EREMOTE returned by is_path_accessible() in
cifs_mount() to be ignored which breaks DFS mounting.

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
(cherry picked from commit d171356ff11ab1825e456dfb979755e01b3c54a1)
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

ol7/spec: sync up linux-firmware version for ol74

Orabug: 26567308
Orabug: 26567283

Because linux-firmware package was updated to version
20170803-56.git7d2c913d.0.1
We need to sync up the version required by QU5 and QU6
kernel.

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>

ol6/spec: sync up linux-firmware version for ol6

Orabug: 26586911
Orabug: 26586927

Because linux-firmware package was updated to version
20170803-56.git7d2c913d.0.1
We need to sync up the version required by QU5 and QU6
kernel.

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>

scsi: scsi_debug: Avoid PI being disabled when TPGS is enabled

It was not possible to enable both T10 PI and TPGS because they share
the same byte in the INQUIRY response. Logically OR the TPGS value
instead of using assignment.

Orabug: 25704090

Reported-by: Ritika Srivastava <ritika.srivastava@oracle.com>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 70bdf2026d905b8bfa0a455d35018df3e9777a6c)
Signed-off-by: Ritika Srivastava <ritika.srivastava@oracle.com>
Reviewed-by: John Sobecki <john.sobecki@oracle.com>
Conflicts:
drivers/scsi/scsi_debug.c

config: enable QEDI config items for UEK

Orabug: 26519978

Enable QEDI driver in UEK config files

Signed-off-by: Brian Maly <brian.maly@oracle.com>

qedi: add firmware related dependencies

Orabug: 26519978

Add some firmware related code qedi driver needs to build properly

Signed-off-by: Brian Maly <brian.maly@oracle.com>

config: enable QED config items for UEK

Orabug: 26519978

Make QED related config items consistent between all x86 config files.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: qedi: fix KAPI for UEK kernels

Orabug: 26519978

There are some minor differences between our kernels KAPI and what
the vendors patchset expected. Fixup KAPI for UEK4 kernels.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

nfsd: encoders mustn't use unitialized values in error cases

In error cases, lgp->lg_layout_type may be out of bounds; so we
shouldn't be using it until after the check of nfserr.

This was seen to crash nfsd threads when the server receives a LAYOUTGET
request with a large layout type.

GETDEVICEINFO has the same problem.

Reported-by: Ari Kauppi <Ari.Kauppi@synopsys.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
(cherry picked from commit f961e3f2acae94b727380c0b74e2d3954d0edf79)

Orabug: 26376568
CVE: CVE-2017-8797

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

nfsd: fix undefined behavior in nfsd4_layout_verify

UBSAN: Undefined behaviour in fs/nfsd/nfs4proc.c:1262:34
shift exponent 128 is too large for 32-bit type 'int'

Depending on compiler+architecture, this may cause the check for
layout_type to succeed for overly large values (which seems to be the
case with amd64). The large value will be later used in de-referencing
nfsd4_layout_ops for function pointers.

Reported-by: Jani Tuovila <tuovila@synopsys.com>
Signed-off-by: Ari Kauppi <ari@synopsys.com>
[colin.king@canonical.com: use LAYOUT_TYPE_MAX instead of 32]
Cc: stable@vger.kernel.org
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
(cherry picked from commit b550a32e60a4941994b437a8d662432a486235a5)

Orabug: 26376568
CVE: CVE-2017-8797

Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
fs/nfsd/nfs4proc.c

ovl: fix workdir creation

Workdir creation fails in latest kernel.

Fix by allowing EOPNOTSUPP as a valid return value from
vfs_removexattr(XATTR_NAME_POSIX_ACL_*). Upper filesystem may not support
ACL and still be perfectly able to support overlayfs.

Reported-by: Martin Ziegler <ziegler@uni-freiburg.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: c11b9fdd6a61 ("ovl: remove posix_acl_default from workdir")
Cc: <stable@vger.kernel.org>
Orabug: 26401569

(backport upstream commit e1ff3dd1ae52cef5b5373c8cc4ad949c2c25a71c)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: listxattr: use strnlen()

Be defensive about what underlying fs provides us in the returned xattr
list buffer. If it's not properly null terminated, bail out with a warning
insead of BUG.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
Orabug: 26401569

(backport upstream commit 7cb35119d067191ce9ebc380a599db0b03cbd9d9)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: copyattr after setting POSIX ACL

Setting POSIX acl may also modify the file mode, so need to copy that up to
the overlay inode.

Reported-by: Eryu Guan <eguan@redhat.com>
Fixes: d837a49bd57f ("ovl: fix POSIX ACL setting")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit ce31513a9114f74fe3e9caa6534d201bdac7238d)
call d_inode to get inode from dentry.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: don't cache acl on overlay layer

Some operations (setxattr/chmod) can make the cached acl stale.  We either
need to clear overlay's acl cache for the affected inode or prevent acl
caching on the overlay altogether.  Preventing caching has the following
advantages:

- no double caching, less memory used

- overlay cache doesn't go stale when fs clears it's own cache

Possible disadvantage is performance loss.  If that becomes a problem
get_acl() can be optimized for overlayfs.

This patch disables caching by pre setting i_*acl to a value that

  - has bit 0 set, so is_uncached_acl() will return true

  - is not equal to ACL_NOT_CACHED, so get_acl() will not overwrite it

The constant -3 was chosen for this purpose.

Fixes: 39a25b2b3762 ("ovl: define ->get_acl() for overlay inodes")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 2a3a2a3f35249412e35fbb48b743348c40373409)
use ACL_NOT_CACHED instead of ACL_DONT_CACHE since the is_uncached_acl
is not available in the current kernel.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: use cached acl on underlying layer

Instead of calling ->get_acl() directly, use get_acl() to get the cached
value.

We will have the acl cached on the underlying inode anyway, because we do
permission checking on the both the overlay and the underlying fs.

So, since we already have double caching, this improves performance without
any cost.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 5201dc449e4b6b6d7e92f7f974269b11681f98b5)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: remove posix_acl_default from workdir

Clear out posix acl xattrs on workdir and also reset the mode after
creation so that an inherited sgid bit is cleared.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
Orabug: 26401569

(backport upstream commit c11b9fdd6a612f376a5e886505f1c54c16d8c380)
change inode_lock to mutex_lock to match the lock usage in the current
kernel.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: handle umask and posix_acl_default correctly on creation

Setting MS_POSIXACL in sb->s_flags has the side effect of passing mode to
create functions without masking against umask.

Another problem when creating over a whiteout is that the default posix acl
is not inherited from the parent dir (because the real parent dir at the
time of creation is the work directory).

Fix these problems by:

a) If upper fs does not have MS_POSIXACL, then mask mode with umask.

b) If creating over a whiteout, call posix_acl_create() to get the
inherited acls. After creation (but before moving to the final
destination) set these acls on the created file. posix_acl_create() also
updates the file creation mode as appropriate.

Fixes: 39a25b2b3762 ("ovl: define ->get_acl() for overlay inodes")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 38b256973ea90fc7c2b7e1b734fa0e8b83538d50)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: change inode_lock to mutex_lock

Orabug: 26401569

Change the lock type in order to back port the patch to the current
kernel which uses mutex_lock instead of inode_lock.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: proper cleanup of workdir

When mounting overlayfs it needs a clean "work" directory under the
supplied workdir.

Previously the mount code removed this directory if it already existed and
created a new one. If the removal failed (e.g. directory was not empty)
then it fell back to a read-only mount not using the workdir.

While this has never been reported, it is possible to get a non-empty
"work" dir from a previous mount of overlayfs in case of crash in the
middle of an operation using the work directory.

In this case the left over state should be discarded and the overlay
filesystem will be consistent, guaranteed by the atomicity of operations on
moving to/from the workdir to the upper layer.

This patch implements cleaning out any files left in workdir. It is
implemented using real recursion for simplicity, but the depth is limited
to 2, because the worst case is that of a directory containing whiteouts
under "work".

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>
Orabug: 26401569

(backport upstream commit eea2fb4851e9dcbab6b991aaf47e2e024f1f55a0)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: rename is_merge to is_lowest

The 'is_merge' is an historical naming from when only a single lower layer
could exist. With the introduction of multiple lower layers the meaning of
this flag was changed to mean only the "lowest layer" (while all lower
layers were being merged).

So now 'is_merge' is inaccurate and hence renaming to 'is_lowest'

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 56656e960b555cb98bc414382566dcb59aae99a2)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: cleanup unused var in rename2

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 6986c012faa480fb0fda74eaae9abb22f7ad1004)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: Switch to generic_getxattr

Now that overlayfs has xattr handlers for iop->{set,remove}xattr, use
those same handlers for iop->getxattr as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Shan Hai <shan.hai@oracle.com>
Orabug: 26401569

(backport upstream commit 0eb45fc3bb7a2cf9c9c93d9e95986a841e5f4625)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: Switch to generic_removexattr

Commit d837a49bd57f ("ovl: fix POSIX ACL setting") switches from
iop->setxattr from ovl_setxattr to generic_setxattr, so switch from
ovl_removexattr to generic_removexattr as well. As far as permission
checking goes, the same rules should apply in either case.

While doing that, rename ovl_setxattr to ovl_xattr_set to indicate that
this is not an iop->setxattr implementation and remove the unused inode
argument.

Move ovl_other_xattr_set above ovl_own_xattr_set so that they match the
order of handlers in ovl_xattr_handlers.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Fixes: d837a49bd57f ("ovl: fix POSIX ACL setting")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Shan Hai <shan.hai@oracle.com>
Orabug: 26401569

(backport upstream commit 0e585ccc13b3edbb187fb4f1b7cc9397f17d64a9)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: Get rid of ovl_xattr_noacl_handlers array

Use an ordinary #ifdef to conditionally include the POSIX ACL handlers
in ovl_xattr_handlers, like the other filesystems do. Flag the code
that is now only used conditionally with __maybe_unused.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 0c97be22f928b85110504c4bbb8574facb4bd0c0)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: Fix OVL_XATTR_PREFIX

Make sure ovl_own_xattr_handler only matches attribute names starting
with "overlay.", not "overlayXXX".

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Fixes: d837a49bd57f ("ovl: fix POSIX ACL setting")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Shan Hai <shan.hai@oracle.com>
Orabug: 26401569

(backport upstream commit fe2b75952347762a21f67d9df1199137ae5988b2)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: xattr filter fix

a) ovl_need_xattr_filter() is wrong, we can have multiple lower layers
overlaid, all of which (except the lowest one) honouring the
"trusted.overlay.opaque" xattr. So need to filter everything except the
bottom and the pure-upper layer.

b) we no longer can assume that inode is attached to dentry in
get/setxattr.

This patch unconditionally filters private xattrs to fix both of the above.
Performance impact for get/removexattrs is likely in the noise.

For listxattrs it might be measurable in pathological cases, but I very
much hope nobody cares. If they do, we'll fix it then.

Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: b96809173e94 ("security_d_instantiate(): move to the point prior to attaching dentry to inode")
Orabug: 26401569

(backport upstream commit b581755b1c565391c72d03b157ba2dd0b18e9d15)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: fix warnings caused by WRITE_ONCE

Orabug: 26401569

The commit 39b681f80(ovl: store real inode pointer in ->i_private)
adds a call to the WRITE_ONCE which generates "initialization makes pointer
from integer without a cast" warnings, fix it by coverting to the required type.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: simplify empty checking

The empty checking logic is duplicated in ovl_check_empty_and_clear() and
ovl_remove_and_whiteout(), except the condition for clearing whiteouts is
different:

ovl_check_empty_and_clear() checked for being upper

ovl_remove_and_whiteout() checked for merge OR lower

Move the intersection of those checks (upper AND merge) into
ovl_check_empty_and_clear() and simplify ovl_remove_and_whiteout().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 30c17ebfb2a11468fe825de19afa3934ee98bfd2)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

qstr: constify instances in overlayfs

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 29c42e80ba5b1e59a4f427b44e2bdebd347b9409)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: clear nlink on rmdir

To make delete notification work on fa/inotify.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit dbc816d05ddcfb189af8784d04fc84c812db3747)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: append MAY_READ when diluting write checks

Right now we remove MAY_WRITE/MAY_APPEND bits from mask if realfile is on
lower/. This is done as files on lower will never be written and will be
copied up. But to copy up a file, mounter should have MAY_READ permission
otherwise copy up will fail. So set MAY_READ in mask when MAY_WRITE is
reset.

Dan Walsh noticed this when he did access(lowerfile, W_OK) and it returned
True (context mounts) but when he tried to actually write to file, it
failed as mounter did not have permission on lower file.

[SzM] don't set MAY_READ if only MAY_APPEND is set without MAY_WRITE; this
won't trigger a copy-up.

Reported-by: Dan Walsh <dwalsh@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 500cac3ccee65526d5075da3af2674101305bf8c)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: dilute permission checks on lower only if not special file

Right now if file is on lower/, we remove MAY_WRITE/MAY_APPEND bits from
mask as lower/ will never be written and file will be copied up. But this
is not true for special files. These files are not copied up and are opened
in place. So don't dilute the checks for these types of files.

Reported-by: Dan Walsh <dwalsh@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit e29841a0ab3d03e77313abd8fb4c16e80fb26e29)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: fix POSIX ACL setting

Setting POSIX ACL needs special handling:

1) Some permission checks are done by ->setxattr() which now uses mounter's
creds ("ovl: do operations on underlying file system in mounter's
context"). These permission checks need to be done with current cred as
well.

2) Setting ACL can fail for various reasons. We do not need to copy up in
these cases.

In the mean time switch to using generic_setxattr.

[Arnd Bergmann] Fix link error without POSIX ACL. posix_acl_from_xattr()
doesn't have a 'static inline' implementation when CONFIG_FS_POSIX_ACL is
disabled, and I could not come up with an obvious way to do it.

This instead avoids the link error by defining two sets of ACL operations
and letting the compiler drop one of the two at compile time depending
on CONFIG_FS_POSIX_ACL. This avoids all references to the ACL code,
also leading to smaller code.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit d837a49bd57f1ec2f6411efa829fecc34002b110)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: share inode for hard link

Inode attributes are copied up to overlay inode (uid, gid, mode, atime,
mtime, ctime) so generic code using these fields works correcty. If a hard
link is created in overlayfs separate inodes are allocated for each link.
If chmod/chown/etc. is performed on one of the links then the inode
belonging to the other ones won't be updated.

This patch attempts to fix this by sharing inodes for hard links.

Use inode hash (with real inode pointer as a key) to make sure overlay
inodes are shared for hard links on upper. Hard links on lower are still
split (which is not user observable until the copy-up happens, see
Documentation/filesystems/overlayfs.txt under "Non-standard behavior").

The inode is only inserted in the hash if it is non-directoy and upper.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 51f7e52dc943468c6929fa0a82d4afac3c8e9636)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: store real inode pointer in ->i_private

To get from overlay inode to real inode we currently use 'struct
ovl_entry', which has lifetime connected to overlay dentry. This is okay,
since each overlay dentry had a new overlay inode allocated.

Following patch will break that assumption, so need to leave out ovl_entry.
This patch stores the real inode directly in i_private, with the lowest bit
used to indicate whether the inode is upper or lower.

Lifetime rules remain, using ovl_inode_real() must only be done while
caller holds ref on overlay dentry (and hence on real dentry), or within
RCU protected regions.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 39b681f8026c170a73972517269efc830db0d7ce)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: permission: return ECHILD instead of ENOENT

The error is due to RCU and is temporary.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit a999d7e161a085e30181d0a88f049bd92112e172)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: update atime on upper

Fix atime update logic in overlayfs.

This patch adds an i_op->update_time() handler to overlayfs inodes.  This
forwards atime updates to the upper layer only.  No atime updates are done
on lower layers.

Remove implicit atime updates to underlying files and directories with
O_NOATIME.  Remove explicit atime update in ovl_readlink().

Clear atime related mnt flags from cloned upper mount.  This means atime
updates are controlled purely by overlayfs mount options.

Reported-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit d719e8f268fa4f9944b24b60814da9017dfb7787)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: convert inode_lock to mutex_lock

Orabug: 26401569

This patch is a fix to the conflict of the upstream commit bb0d2b8ad29
(ovl: fix sgid on directory), convert the lock type to match with the
lock usage of the current kernel.

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: fix sgid on directory

When creating directory in workdir, the group/sgid inheritance from the
parent dir was omitted completely. Fix this by calling inode_init_owner()
on overlay inode and using the resulting uid/gid/mode to create the file.

Unfortunately the sgid bit can be stripped off due to umask, so need to
reset the mode in this case in workdir before moving the directory in
place.

Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit bb0d2b8ad29630b580ac903f989e704e23462357)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: simplify permission checking

The fact that we always do permission checking on the overlay inode and
clear MAY_WRITE for checking access to the lower inode allows cruft to be
removed from ovl_permission().

1) "default_permissions" option effectively did generic_permission() on the
overlay inode with i_mode, i_uid and i_gid updated from underlying
filesystem. This is what we do by default now. It did the update using
vfs_getattr() but that's only needed if the underlying filesystem can
change (which is not allowed). We may later introduce a "paranoia_mode"
that verifies that mode/uid/gid are not changed.

2) splitting out the IS_RDONLY() check from inode_permission() also becomes
unnecessary once we remove the MAY_WRITE from the lower inode check.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 9c630ebefeeee4363ffd29f2f9b18eddafc6479c)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: do not require mounter to have MAY_WRITE on lower

Now we have two levels of checks in ovl_permission(). overlay inode
is checked with the creds of task while underlying inode is checked
with the creds of mounter.

Looks like mounter does not have to have WRITE access to files on lower/.
So remove the MAY_WRITE from access mask for checks on underlying
lower inode.

This means task should still have the MAY_WRITE permission on lower
inode and mounter is not required to have MAY_WRITE.

It also solves the problem of read only NFS mounts being used as lower.
If __inode_permission(lower_inode, MAY_WRITE) is called on read only
NFS, it fails. By resetting MAY_WRITE, check succeeds and case of
read only NFS shold work with overlay without having to specify any
special mount options (default permission).

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 754f8cb72b42a3a6100d2bbb1cb885361a7310dd)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: do operations on underlying file system in mounter's context

Given we are now doing checks both on overlay inode as well underlying
inode, we should be able to do checks and operations on underlying file
system using mounter's context.

So modify all operations to do checks/operations on underlying dentry/inode
in the context of mounter.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 1175b6b8d96331676f1d436b089b965807f23b4a)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: fix uid/gid when creating over whiteout

Fix a regression when creating a file over a whiteout.  The new
file/directory needs to use the current fsuid/fsgid, not the ones from the
mounter's credentials.

The refcounting is a bit tricky: prepare_creds() sets an original refcount,
override_creds() gets one more, which revert_cred() drops.  So

  1) we need to expicitly put the mounter's credentials when overriding
     with the updated one

  2) we need to put the original ref to the updated creds (and this can
     safely be done before revert_creds(), since we'll still have the ref
     from override_creds()).

Reported-by: Stephen Smalley <sds@tycho.nsa.gov>
Fixes: 3fe6e52f0626 ("ovl: override creds with the ones from the superblock mounter")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit d0e13f5bbe4be7c8f27736fc40503dcec04b7de0)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: modify ovl_permission() to do checks on two inodes

Right now ovl_permission() calls __inode_permission(realinode), to do
permission checks on real inode and no checks are done on overlay inode.

Modify it to do checks both on overlay inode as well as underlying inode.
Checks on overlay inode will be done with the creds of calling task while
checks on underlying inode will be done with the creds of mounter.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit c0ca3d70e8d3cf81e2255a217f7ca402f5ed0862)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: define ->get_acl() for overlay inodes

Now we are planning to do DAC permission checks on overlay inode
itself. And to make it work, we will need to make sure we can get acls from
underlying inode. So define ->get_acl() for overlay inodes and this in turn
calls into underlying filesystem to get acls, if any.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 39a25b2b37629f65e5a1eba1b353d0b47687c2ca)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: move some common code in a function

ovl_create_upper() and ovl_create_over_whiteout() seem to be sharing some
common code which can be moved into a separate function. No functionality
change.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 72e48481815eeca72fc886b3be91301ad87d6aeb)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: store ovl_entry in inode->i_private for all inodes

Previously this was only done for directory inodes. Doing so for all
inodes makes for a nice cleanup in ovl_permission at zero cost.

Inodes are not shared for hard links on the overlay, so this works fine.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 58ed4e70f253d80ed72faba7873dc11603b398bc)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: use generic_delete_inode

No point in keeping overlay inodes around since they will never be reused.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit eead4f2dc4f851a3790c49850e96a1d155bf5451)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: check mounter creds on underlying lookup

The hash salting changes meant that we can no longer reuse the hash in the
overlay dentry to look up the underlying dentry.

Instead of lookup_hash(), use lookup_one_len_unlocked() and swith to
mounter's creds (like we do for all other operations later in the series).

Now the lookup_hash() export introduced in 4.6 by 3c9fe8cdff1b ("vfs: add
lookup_hash() helper") is unused and can possibly be removed; its
usefulness negated by the hash salting and the idea that mounter's creds
should be used on operations on underlying filesystems.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 8387ff2577eb ("vfs: make the string hashes salt the hash")
Orabug: 26401569

(backport upstream commit c1b2cc1a765aff4df7b22abe6b66014236f73eba)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: ignore permissions on underlying lookup

Generally permission checking is not necessary when overlayfs looks up a
dentry on one of the underlying layers, since search permission on base
directory was already checked in ovl_permission().

More specifically using lookup_one_len() causes a problem when the lower
directory lacks search permission for a specific user while the upper
directory does have search permission. Since lookups are cached, this
causes inconsistency in behavior: success depends on who did the first
lookup.

So instead use lookup_hash() which doesn't do the permission check.

Reported-by: Ignacy Gawędzki <ignacy.gawedzki@green-communications.fr>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 38b78a5f18584db6fa7441e0f4531b283b0e6725)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: override creds with the ones from the superblock mounter

In user namespace the whiteout creation fails with -EPERM because the
current process isn't capable(CAP_SYS_ADMIN) when setting xattr.

A simple reproducer:

$ mkdir upper lower work merged lower/dir
$ sudo mount -t overlay overlay -olowerdir=lower,upperdir=upper,workdir=work merged
$ unshare -m -p -f -U -r bash

Now as root in the user namespace:

\# touch merged/dir/{1,2,3} # this will force a copy up of lower/dir
\# rm -fR merged/*

This ends up failing with -EPERM after the files in dir has been
correctly deleted:

unlinkat(4, "2", 0)                     = 0
unlinkat(4, "1", 0)                     = 0
unlinkat(4, "3", 0)                     = 0
close(4)                                = 0
unlinkat(AT_FDCWD, "merged/dir", AT_REMOVEDIR) = -1 EPERM (Operation not
permitted)

Interestingly, if you don't place files in merged/dir you can remove it,
meaning if upper/dir does not exist, creating the char device file works
properly in that same location.

This patch uses ovl_sb_creator_cred() to get the cred struct from the
superblock mounter and override the old cred with these new ones so that
the whiteout creation is possible because overlay is wrong in assuming that
the creds it will get with prepare_creds will be in the initial user
namespace.  The old cap_raise game is removed in favor of just overriding
the old cred struct.

This patch also drops from ovl_copy_up_one() the following two lines:

override_cred->fsuid = stat->uid;
override_cred->fsgid = stat->gid;

This is because the correct uid and gid are taken directly with the stat
struct and correctly set with ovl_set_attr().

Signed-off-by: Antonio Murdaca <runcom@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569

(backport upstream commit 3fe6e52f062643676eb4518d68cee3bc1272091b)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: fix dentry leak for default_permissions

When using the 'default_permissions' mount option, ovl_permission() on
non-directories was missing a dput(alias), resulting in "BUG Dentry still
in use".

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 8d3095f4ad47 ("ovl: default permissions")
Cc: <stable@vger.kernel.org> # v4.5+
Orabug: 26401569

(backport upstream commit a4859d75944a726533ab86d24bb5ffd1b2b7d6cc)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

ovl: fix open in stacked overlay

If two overlayfs filesystems are stacked on top of each other, then we need
recursion in ovl_d_select_inode().

I guess d_backing_inode() is supposed to do that. But currently it doesn't
and that functionality is open coded in vfs_open(). This is now copied
into ovl_d_select_inode() to fix this regression.

Reported-by: Alban Crequy <alban.crequy@gmail.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Fixes: 4bacc9c9234c ("overlayfs: Make f_path always point to the overlay...")
Cc: David Howells <dhowells@redhat.com>
Cc: <stable@vger.kernel.org> # v4.2+
Orabug: 26401569

(backport upstream commit 1c8a47df36d72ace8cf78eb6c228aa0f8027d3c2)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

nfsd: don't hold i_mutex over userspace upcalls

We need information about exports when crossing mountpoints during
lookup or NFSv4 readdir.  If we don't already have that information
cached, we may have to ask (and wait for) rpc.mountd.

In both cases we currently hold the i_mutex on the parent of the
directory we're asking rpc.mountd about.  We've seen situations where
rpc.mountd performs some operation on that directory that tries to take
the i_mutex again, resulting in deadlock.

With some care, we may be able to avoid that in rpc.mountd.  But it
seems better just to avoid holding a mutex while waiting on userspace.

It appears that lookup_one_len is pretty much the only operation that
needs the i_mutex.  So we could just drop the i_mutex elsewhere and do
something like

mutex_lock()
lookup_one_len()
mutex_unlock()

In many cases though the lookup would have been cached and not required
the i_mutex, so it's more efficient to create a lookup_one_len() variant
that only takes the i_mutex when necessary.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Orabug: 26401569

(backport upstream commit bbddca8e8fac07ece3938e03526b5d00fa791a4c)

Signed-off-by: Shan Hai <shan.hai@oracle.com>
Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>

Revert "ixgbevf: get rid of custom busy polling code"

This reverts commit 1975e69c708706b84d9462ce7c0135d33310c28a. Performance regression,
because the net/core napi support is not present.

Orabug: 26494997
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>

Revert "ixgbe: get rid of custom busy polling code"

This reverts commit 9244251e4f45dc9a61dd094a5d7ba23bb0285a86. The core napi support is
not in place, we need to keep the driver support or performance suffers.

Orabug: 26494997
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>

ocfs2: fix deadlock caused by recursive locking in xattr

Orabug: 26427132

Another deadlock path caused by recursive locking is reported.  This
kind of issue was introduced since commit 743b5f1434f5 ("ocfs2: take
inode lock in ocfs2_iop_set/get_acl()").  Two deadlock paths have been
fixed by commit b891fa5024a9 ("ocfs2: fix deadlock issue when taking
inode lock at vfs entry points").  Yes, we intend to fix this kind of
case in incremental way, because it's hard to find out all possible
paths at once.

This one can be reproduced like this.  On node1, cp a large file from
home directory to ocfs2 mountpoint.  While on node2, run
setfacl/getfacl.  Both nodes will hang up there.  The backtraces:

On node1:
  __ocfs2_cluster_lock.isra.39+0x357/0x740 [ocfs2]
  ocfs2_inode_lock_full_nested+0x17d/0x840 [ocfs2]
  ocfs2_write_begin+0x43/0x1a0 [ocfs2]
  generic_perform_write+0xa9/0x180
  __generic_file_write_iter+0x1aa/0x1d0
  ocfs2_file_write_iter+0x4f4/0xb40 [ocfs2]
  __vfs_write+0xc3/0x130
  vfs_write+0xb1/0x1a0
  SyS_write+0x46/0xa0

On node2:
  __ocfs2_cluster_lock.isra.39+0x357/0x740 [ocfs2]
  ocfs2_inode_lock_full_nested+0x17d/0x840 [ocfs2]
  ocfs2_xattr_set+0x12e/0xe80 [ocfs2]
  ocfs2_set_acl+0x22d/0x260 [ocfs2]
  ocfs2_iop_set_acl+0x65/0xb0 [ocfs2]
  set_posix_acl+0x75/0xb0
  posix_acl_xattr_set+0x49/0xa0
  __vfs_setxattr+0x69/0x80
  __vfs_setxattr_noperm+0x72/0x1a0
  vfs_setxattr+0xa7/0xb0
  setxattr+0x12d/0x190
  path_setxattr+0x9f/0xb0
  SyS_setxattr+0x14/0x20

Fix this one by using ocfs2_inode_{lock|unlock}_tracker, which is
exported by commit 439a36b8ef38 ("ocfs2/dlmglue: prepare tracking logic
to avoid recursive cluster lock").

Link: http://lkml.kernel.org/r/20170622014746.5815-1-zren@suse.com
Fixes: 743b5f1434f5 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
Signed-off-by: Eric Ren <zren@suse.com>
Reported-by: Thomas Voegtle <tv@lio96.de>
Tested-by: Thomas Voegtle <tv@lio96.de>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherrypicked from commit 8818efaaacb78c60a9d90c5705b6c99b75d7d442)
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>