David Matlack [Mon, 19 Dec 2016 21:58:25 +0000 (13:58 -0800)]
kvm: x86: reduce collisions in mmu_page_hash
When using two-dimensional paging, the mmu_page_hash (which provides
lookups for existing kvm_mmu_page structs), becomes imbalanced; with
too many collisions in buckets 0 and 512. This has been seen to cause
mmu_lock to be held for multiple milliseconds in kvm_mmu_get_page on
VMs with a large amount of RAM mapped with 4K pages.
The current hash function uses the lower 10 bits of gfn to index into
mmu_page_hash. When doing shadow paging, gfn is the address of the
guest page table being shadow. These tables are 4K-aligned, which
makes the low bits of gfn a good hash. However, with two-dimensional
paging, no guest page tables are being shadowed, so gfn is the base
address that is mapped by the table. Thus page tables (level=1) have
a 2MB aligned gfn, page directories (level=2) have a 1GB aligned gfn,
etc. This means hashes will only differ in their 10th bit.
hash_64() provides a better hash. For example, on a VM with ~200G
(99458 direct=1 kvm_mmu_page structs):
When we leave the multicast group on expiration of a neighbor we
do not free the mcast structure. This results in a memory leak
that causes ib_dealloc_pd to fail and print a WARN_ON message
and backtrace.
When performing sendonly joins, we queue the packets that trigger
the join until the join completes. This may take on the order of
hundreds of milliseconds. It is easy to have many more than three
packets come in during that time. Expand the maximum queue depth
in order to try and prevent dropped packets during the time it
takes to join the multicast group.
Since IPoIB should, as much as possible, emulate how multicast
sends work on Ethernet for regular TCP/IP apps, there should be
no requirement to subscribe to a multicast group before your
sends are properly sent. However, due to the difference in how
multicast is handled on InfiniBand, we must join the appropriate
multicast group before we can send to it. Previously we tried
not to trigger the auto-create feature of the subnet manager when
doing this because we didn't have tracking of these sendonly
groups and the auto-creation might never get undone. The previous
patch added timing to these sendonly joins and allows us to
leave them after a reasonable idle expiration time. So supply
all of the information needed to auto-create group.
On neighbor expiration, check to see if the neighbor was actually a
sendonly multicast join, and if so, leave the multicast group as we
expire the neighbor.
Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit bd99b2e05c4df2a428e5c9dd338289089d0e26df)
When limiting the argv/envp strings during exec to 1/4 of the stack limit,
the storage of the pointers to the strings was not included. This means
that an exec with huge numbers of tiny strings could eat 1/4 of the stack
limit in strings and then additional space would be later used by the
pointers to the strings.
For example, on 32-bit with a 8MB stack rlimit, an exec with 1677721
single-byte strings would consume less than 2MB of stack, the max (8MB /
4) amount allowed, but the pointers to the strings would consume the
remaining additional stack space (1677721 * 4 == 6710884).
The result (1677721 + 6710884 == 8388605) would exhaust stack space
entirely. Controlling this stack exhaustion could result in
pathological behavior in setuid binaries (CVE-2017-1000365).
[akpm@linux-foundation.org: additional commenting from Kees] Fixes: b6a2fea39318 ("mm: variable length argument support") Link: http://lkml.kernel.org/r/20170622001720.GA32173@beast Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Qualys Security Advisory <qsa@qualys.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 98da7d08850fb8bdeb395d6368ed15753304aa0c) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Vincent Guittot [Thu, 8 Dec 2016 16:56:54 +0000 (17:56 +0100)]
sched/core: Use load_avg for selecting idlest group
find_idlest_group() only compares the runnable_load_avg when looking
for the least loaded group. But on fork intensive use case like
hackbench where tasks blocked quickly after the fork, this can lead to
selecting the same CPU instead of other CPUs, which have similar
runnable load but a lower load_avg.
When the runnable_load_avg of 2 CPUs are close, we now take into
account the amount of blocked load as a 2nd selection factor. There is
now 3 zones for the runnable_load of the rq:
- [0 .. (runnable_load - imbalance)]:
Select the new rq which has significantly less runnable_load
- [(runnable_load - imbalance) .. (runnable_load + imbalance)]:
The runnable loads are close so we use load_avg to chose
between the 2 rq
- [(runnable_load + imbalance) .. ULONG_MAX]:
Keep the current rq which has significantly less runnable_load
The scale factor that is currently used for comparing runnable_load,
doesn't work well with small value. As an example, the use of a
scaling factor fails as soon as this_runnable_load == 0 because we
always select local rq even if min_runnable_load is only 1, which
doesn't really make sense because they are just the same. So instead
of scaling factor, we use an absolute margin for runnable_load to
detect CPUs with similar runnable_load and we keep using scaling
factor for blocked load.
For use case like hackbench, this enable the scheduler to select
different CPUs during the fork sequence and to spread tasks across the
system.
Tests have been done on a Hikey board (ARM based octo cores) for
several kernel. The result below gives min, max, avg and stdev values
of 18 runs with each configuration.
The patches depend on the "no missing update_rq_clock()" work.
take_dentry_name_snapshot() takes a safe snapshot of dentry name;
if the name is a short one, it gets copied into caller-supplied
structure, otherwise an extra reference to external name is grabbed
(those are never modified). In either case the pointer to stable
string is stored into the same structure.
dentry must be held by the caller of take_dentry_name_snapshot(),
but may be freely dropped afterwards - the snapshot will stay
until destroyed by release_dentry_name_snapshot().
Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name
to pass down with event.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 49d31c2f389acfe83417083e1208422b4091cd9e) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Chuck Lever [Thu, 1 Jun 2017 16:03:38 +0000 (12:03 -0400)]
NFSv4.1: Use seqid returned by EXCHANGE_ID after state migration
Transparent State Migration copies a client's lease state from the
server where a filesystem used to reside to the server where it now
resides. When an NFSv4.1 client first contacts that destination
server, it uses EXCHANGE_ID to detect trunking relationships.
The lease that was copied there is returned to that client, but the
destination server sets EXCHGID4_FLAG_CONFIRMED_R when replying to
the client. This is because the lease was confirmed on the source
server (before it was copied).
Normally, when CONFIRMED_R is set, a client purges the lease and
creates a new one. However, that throws away the entire benefit of
Transparent State Migration.
Therefore, the client must use the contrived slot sequence value
returned by the destination server for its first CREATE_SESSION
operation after a Transparent State Migration.
Orabug: 25802443 Reported-by: Xuan Qi <xuan.qi@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
(hand picked mainline 838edb9 NFSv4.1: Use seqid returned ...) Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>
The skb_copy_and_csum_bits(skb_prev, maxfraglen, data + transhdrlen,
fraggap, 0); is overwriting skb->head and skb_shared_info
Since we apparently detect this rare condition too late, move the
code earlier to even avoid allocating skb and risking crashes.
Once again, many thanks to Andrey and syzkaller team.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: <idaifish@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 232cd35d0804cc241eb887bb8d4d9b3b9881c64a) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
net/ipv6/ip6_output.c
CAI Qian <caiqian@redhat.com> pointed out that the semantics
of shared subtrees make it possible to create an exponentially
increasing number of mounts in a mount namespace.
mkdir /tmp/1 /tmp/2
mount --make-rshared /
for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done
Will create create 2^20 or 1048576 mounts, which is a practical problem
as some people have managed to hit this by accident.
As such CVE-2016-6213 was assigned.
Ian Kent <raven@themaw.net> described the situation for autofs users
as follows:
> The number of mounts for direct mount maps is usually not very large because of
> the way they are implemented, large direct mount maps can have performance
> problems. There can be anywhere from a few (likely case a few hundred) to less
> than 10000, plus mounts that have been triggered and not yet expired.
>
> Indirect mounts have one autofs mount at the root plus the number of mounts that
> have been triggered and not yet expired.
>
> The number of autofs indirect map entries can range from a few to the common
> case of several thousand and in rare cases up to between 30000 and 50000. I've
> not heard of people with maps larger than 50000 entries.
>
> The larger the number of map entries the greater the possibility for a large
> number of active mounts so it's not hard to expect cases of a 1000 or somewhat
> more active mounts.
So I am setting the default number of mounts allowed per mount
namespace at 100,000. This is more than enough for any use case I
know of, but small enough to quickly stop an exponential increase
in mounts. Which should be perfect to catch misconfigurations and
malfunctioning programs.
For anyone who needs a higher limit this can be changed by writing
to the new /proc/sys/fs/mount-max sysctl.
Tested-by: CAI Qian <caiqian@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
(cherry picked from commit d29216842a85c7970c536108e093963f02714498) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
fs/namespace.c
kernel/sysctl.c
Guillaume Nault [Fri, 18 Nov 2016 21:13:00 +0000 (22:13 +0100)]
l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()
Lock socket before checking the SOCK_ZAPPED flag in l2tp_ip6_bind().
Without lock, a concurrent call could modify the socket flags between
the sock_flag(sk, SOCK_ZAPPED) test and the lock_sock() call. This way,
a socket could be inserted twice in l2tp_ip6_bind_table. Releasing it
would then leave a stale pointer there, generating use-after-free
errors when walking through the list or modifying adjacent entries.
BUG: KASAN: use-after-free in l2tp_ip6_close+0x22e/0x290 at addr ffff8800081b0ed8
Write of size 8 by task syz-executor/10987
CPU: 0 PID: 10987 Comm: syz-executor Not tainted 4.8.0+ #39
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014 ffff880031d97838ffffffff829f835bffff88001b5a1640ffff8800081b0ec0 ffff8800081b15a0ffff8800081b6d20ffff880031d97860ffffffff8174d3cc ffff880031d978f0ffff8800081b0e80ffff88001b5a1640ffff880031d978e0
Call Trace:
[<ffffffff829f835b>] dump_stack+0xb3/0x118 lib/dump_stack.c:15
[<ffffffff8174d3cc>] kasan_object_err+0x1c/0x70 mm/kasan/report.c:156
[< inline >] print_address_description mm/kasan/report.c:194
[<ffffffff8174d666>] kasan_report_error+0x1f6/0x4d0 mm/kasan/report.c:283
[< inline >] kasan_report mm/kasan/report.c:303
[<ffffffff8174db7e>] __asan_report_store8_noabort+0x3e/0x40 mm/kasan/report.c:329
[< inline >] __write_once_size ./include/linux/compiler.h:249
[< inline >] __hlist_del ./include/linux/list.h:622
[< inline >] hlist_del_init ./include/linux/list.h:637
[<ffffffff8579047e>] l2tp_ip6_close+0x22e/0x290 net/l2tp/l2tp_ip6.c:239
[<ffffffff850b2dfd>] inet_release+0xed/0x1c0 net/ipv4/af_inet.c:415
[<ffffffff851dc5a0>] inet6_release+0x50/0x70 net/ipv6/af_inet6.c:422
[<ffffffff84c4581d>] sock_release+0x8d/0x1d0 net/socket.c:570
[<ffffffff84c45976>] sock_close+0x16/0x20 net/socket.c:1017
[<ffffffff817a108c>] __fput+0x28c/0x780 fs/file_table.c:208
[<ffffffff817a1605>] ____fput+0x15/0x20 fs/file_table.c:244
[<ffffffff813774f9>] task_work_run+0xf9/0x170
[<ffffffff81324aae>] do_exit+0x85e/0x2a00
[<ffffffff81326dc8>] do_group_exit+0x108/0x330
[<ffffffff81348cf7>] get_signal+0x617/0x17a0 kernel/signal.c:2307
[<ffffffff811b49af>] do_signal+0x7f/0x18f0
[<ffffffff810039bf>] exit_to_usermode_loop+0xbf/0x150 arch/x86/entry/common.c:156
[< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190
[<ffffffff81006060>] syscall_return_slowpath+0x1a0/0x1e0 arch/x86/entry/common.c:259
[<ffffffff85e4d726>] entry_SYSCALL_64_fastpath+0xc4/0xc6
Object at ffff8800081b0ec0, in cache L2TP/IPv6 size: 1448
Allocated:
PID = 10987
[ 1116.897025] [<ffffffff811ddcb6>] save_stack_trace+0x16/0x20
[ 1116.897025] [<ffffffff8174c736>] save_stack+0x46/0xd0
[ 1116.897025] [<ffffffff8174c9ad>] kasan_kmalloc+0xad/0xe0
[ 1116.897025] [<ffffffff8174cee2>] kasan_slab_alloc+0x12/0x20
[ 1116.897025] [< inline >] slab_post_alloc_hook mm/slab.h:417
[ 1116.897025] [< inline >] slab_alloc_node mm/slub.c:2708
[ 1116.897025] [< inline >] slab_alloc mm/slub.c:2716
[ 1116.897025] [<ffffffff817476a8>] kmem_cache_alloc+0xc8/0x2b0 mm/slub.c:2721
[ 1116.897025] [<ffffffff84c4f6a9>] sk_prot_alloc+0x69/0x2b0 net/core/sock.c:1326
[ 1116.897025] [<ffffffff84c58ac8>] sk_alloc+0x38/0xae0 net/core/sock.c:1388
[ 1116.897025] [<ffffffff851ddf67>] inet6_create+0x2d7/0x1000 net/ipv6/af_inet6.c:182
[ 1116.897025] [<ffffffff84c4af7b>] __sock_create+0x37b/0x640 net/socket.c:1153
[ 1116.897025] [< inline >] sock_create net/socket.c:1193
[ 1116.897025] [< inline >] SYSC_socket net/socket.c:1223
[ 1116.897025] [<ffffffff84c4b46f>] SyS_socket+0xef/0x1b0 net/socket.c:1203
[ 1116.897025] [<ffffffff85e4d685>] entry_SYSCALL_64_fastpath+0x23/0xc6
Freed:
PID = 10987
[ 1116.897025] [<ffffffff811ddcb6>] save_stack_trace+0x16/0x20
[ 1116.897025] [<ffffffff8174c736>] save_stack+0x46/0xd0
[ 1116.897025] [<ffffffff8174cf61>] kasan_slab_free+0x71/0xb0
[ 1116.897025] [< inline >] slab_free_hook mm/slub.c:1352
[ 1116.897025] [< inline >] slab_free_freelist_hook mm/slub.c:1374
[ 1116.897025] [< inline >] slab_free mm/slub.c:2951
[ 1116.897025] [<ffffffff81748b28>] kmem_cache_free+0xc8/0x330 mm/slub.c:2973
[ 1116.897025] [< inline >] sk_prot_free net/core/sock.c:1369
[ 1116.897025] [<ffffffff84c541eb>] __sk_destruct+0x32b/0x4f0 net/core/sock.c:1444
[ 1116.897025] [<ffffffff84c5aca4>] sk_destruct+0x44/0x80 net/core/sock.c:1452
[ 1116.897025] [<ffffffff84c5ad33>] __sk_free+0x53/0x220 net/core/sock.c:1460
[ 1116.897025] [<ffffffff84c5af23>] sk_free+0x23/0x30 net/core/sock.c:1471
[ 1116.897025] [<ffffffff84c5cb6c>] sk_common_release+0x28c/0x3e0 ./include/net/sock.h:1589
[ 1116.897025] [<ffffffff8579044e>] l2tp_ip6_close+0x1fe/0x290 net/l2tp/l2tp_ip6.c:243
[ 1116.897025] [<ffffffff850b2dfd>] inet_release+0xed/0x1c0 net/ipv4/af_inet.c:415
[ 1116.897025] [<ffffffff851dc5a0>] inet6_release+0x50/0x70 net/ipv6/af_inet6.c:422
[ 1116.897025] [<ffffffff84c4581d>] sock_release+0x8d/0x1d0 net/socket.c:570
[ 1116.897025] [<ffffffff84c45976>] sock_close+0x16/0x20 net/socket.c:1017
[ 1116.897025] [<ffffffff817a108c>] __fput+0x28c/0x780 fs/file_table.c:208
[ 1116.897025] [<ffffffff817a1605>] ____fput+0x15/0x20 fs/file_table.c:244
[ 1116.897025] [<ffffffff813774f9>] task_work_run+0xf9/0x170
[ 1116.897025] [<ffffffff81324aae>] do_exit+0x85e/0x2a00
[ 1116.897025] [<ffffffff81326dc8>] do_group_exit+0x108/0x330
[ 1116.897025] [<ffffffff81348cf7>] get_signal+0x617/0x17a0 kernel/signal.c:2307
[ 1116.897025] [<ffffffff811b49af>] do_signal+0x7f/0x18f0
[ 1116.897025] [<ffffffff810039bf>] exit_to_usermode_loop+0xbf/0x150 arch/x86/entry/common.c:156
[ 1116.897025] [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190
[ 1116.897025] [<ffffffff81006060>] syscall_return_slowpath+0x1a0/0x1e0 arch/x86/entry/common.c:259
[ 1116.897025] [<ffffffff85e4d726>] entry_SYSCALL_64_fastpath+0xc4/0xc6
Memory state around the buggy address: ffff8800081b0d80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff8800081b0e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff8800081b0e80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
^ ffff8800081b0f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8800081b0f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
David Howells [Tue, 18 Apr 2017 14:31:07 +0000 (15:31 +0100)]
KEYS: Disallow keyrings beginning with '.' to be joined as session keyrings
This fixes CVE-2016-9604.
Keyrings whose name begin with a '.' are special internal keyrings and so
userspace isn't allowed to create keyrings by this name to prevent
shadowing. However, the patch that added the guard didn't fix
KEYCTL_JOIN_SESSION_KEYRING. Not only can that create dot-named keyrings,
it can also subscribe to them as a session keyring if they grant SEARCH
permission to the user.
This, for example, allows a root process to set .builtin_trusted_keys as
its session keyring, at which point it has full access because now the
possessor permissions are added. This permits root to add extra public
keys, thereby bypassing module verification.
This also affects kexec and IMA.
This can be tested by (as root):
keyctl session .builtin_trusted_keys
keyctl add user a a @s
keyctl list @s
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Eric Dumazet [Wed, 17 May 2017 14:16:40 +0000 (07:16 -0700)]
sctp: do not inherit ipv6_{mc|ac|fl}_list from parent
SCTP needs fixes similar to 83eaddab4378 ("ipv6/dccp: do not inherit
ipv6_mc_list from parent"), otherwise bad things can happen.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit fdcee2cbb8438702ea1b328fb6e0ac5e9a40c7f8)
Herbert Xu [Mon, 21 Nov 2016 07:34:00 +0000 (15:34 +0800)]
crypto: algif_hash - Fix result clobbering in recvmsg
Recently an init call was added to hash_recvmsg so as to reset
the hash state in case a sendmsg call was never made.
Unfortunately this ended up clobbering the result if the previous
sendmsg was done with a MSG_MORE flag. This patch fixes it by
excluding that case when we make the init call.
Fixes: a8348bca2944 ("algif_hash - Fix NULL hash crash with shash") Reported-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 8acf7a106326eb94e143552de81f34308149121c)
Herbert Xu [Thu, 17 Nov 2016 14:07:58 +0000 (22:07 +0800)]
crypto: algif_hash - Fix NULL hash crash with shash
Recently algif_hash has been changed to allow null hashes. This
triggers a bug when used with an shash algorithm whereby it will
cause a crash during the digest operation.
This patch fixes it by avoiding the digest operation and instead
doing an init followed by a final which avoids the buggy code in
shash.
This patch also ensures that the result buffer is freed after an
error so that it is not returned as a genuine hash result on the
next recv call.
The shash/ahash wrapper code will be fixed later to handle this
case correctly.
Fixes: 493b2ed3f760 ("crypto: algif_hash - Handle NULL hashes correctly") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Laura Abbott <labbott@redhat.com>
(cherry picked from commit a8348bca2944d397a528772f5c0ccb47a8b58af4)
Herbert Xu [Thu, 1 Sep 2016 09:16:44 +0000 (17:16 +0800)]
crypto: algif_hash - Handle NULL hashes correctly
Right now attempting to read an empty hash simply returns zeroed
bytes, this patch corrects this by calling the digest function
using an empty input.
Reported-by: Russell King - ARM Linux <linux@armlinux.org.uk> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 493b2ed3f7603a15ff738553384d5a4510ffeb95)
Thomas Gleixner [Sun, 2 Aug 2015 20:38:23 +0000 (20:38 +0000)]
x86/irq: Protect smp_cleanup_move
smp_cleanup_move fiddles without protection in the interrupt
descriptors and the vector array. A concurrent irq setup/teardown or
affinity setting can pull the rug under that operation.
Mitch Williams [Tue, 29 Sep 2015 00:31:26 +0000 (17:31 -0700)]
i40e/i40evf: check for stopped admin queue
It's possible that while we are waiting for the spinlock, another
entity (that owns the spinlock) has shut down the admin queue.
If we then attempt to use the queue, we will panic.
Add a check for this condition on the receive side. This matches
an existing check on the send queue side.
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 26654196
(cherry picked from commit 43ae93a93e8c95c5e6389dc8e11704712b1ab2e9) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Reviewed-by: Brian Maly <brian.maly@oracle.com>
When using the same file as the source and destination for a dedup
(extent_same ioctl) operation we were allowing it to dedup to a
destination offset beyond the file's size, which doesn't make sense and
it's not allowed for the case where the source and destination files are
not the same file. This made de deduplication operation successful only
when the source range corresponded to a hole, a prealloc extent or an
extent with all bytes having a value of 0x00. This was also leaving a
file hole (between i_size and destination offset) without the
corresponding file extent items, which can be reproduced with the
following steps for example:
$ mkfs.btrfs -f /dev/sdi
$ mount /dev/sdi /mnt/sdi
$ xfs_io -f -c "pwrite -S 0xab 304457 404990" /mnt/sdi/foobar
wrote 404990/404990 bytes at offset 304457
395 KiB, 99 ops; 0.0000 sec (31.150 MiB/sec and 7984.5149 ops/sec)
$ /git/hub/duperemove/btrfs-extent-same 24576 /mnt/sdi/foobar 28672 /mnt/sdi/foobar 929792
Deduping 2 total files
(28672, 24576): /mnt/sdi/foobar
(929792, 24576): /mnt/sdi/foobar
1 files asked to be deduped
i: 0, status: 0, bytes_deduped: 24576
24576 total bytes deduped in this operation
$ umount /mnt/sdi
$ btrfsck /dev/sdi
Checking filesystem on /dev/sdi
UUID: 98c528aa-0833-427d-9403-b98032ffbf9d
checking extents
checking free space cache
checking fs roots
root 5 inode 257 errors 100, file extent discount
Found file extent holes:
start: 712704, len: 217088
found 540673 bytes used err is 1
total csum bytes: 400
total tree bytes: 131072
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 123675
file data blocks allocated: 671744
referenced 671744
btrfs-progs v4.2.3
So fix this by not allowing the destination to go beyond the file's size,
just as we do for the same where the source and destination files are not
the same.
A test for xfstests follows.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
Orabug: 26441487
(cherry picked from commit f4dfe6871006c62abdccc77b2818b11f376e98e2) Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Mark Fasheh [Tue, 30 Jun 2015 21:42:06 +0000 (14:42 -0700)]
btrfs: fix clone / extent-same deadlocks
Clone and extent same lock their source and target inodes in opposite order.
In addition to this, the range locking in clone doesn't take ordering into
account. Fix this by having clone use the same locking helpers as
btrfs-extent-same.
In addition, I do a small cleanup of the locking helpers, removing a case
(both inodes being the same) which was poorly accounted for and never
actually used by the callers.
Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
(Cherry picked from commit 293a8489f300536dc6d996c35a6ebb89aa03bab2)
Mark Fasheh [Tue, 30 Jun 2015 21:42:08 +0000 (14:42 -0700)]
btrfs: don't update mtime/ctime on deduped inodes
One issue users have reported is that dedupe changes mtime on files,
resulting in tools like rsync thinking that their contents have changed when
in fact the data is exactly the same. We also skip the ctime update as no
user-visible metadata changes here and we want dedupe to be transparent to
the user.
Clone still wants time changes, so we special case this in the code.
Mark Fasheh [Tue, 30 Jun 2015 21:42:07 +0000 (14:42 -0700)]
btrfs: allow dedupe of same inode
clone() supports cloning within an inode so extent-same can do
the same now. This patch fixes up the locking in extent-same to
know about the single-inode case. In addition to that, we add a
check for overlapping ranges, which clone does not allow.
Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
(Cherry picked from commit 0efa9f48c7e6c15e75946dd2b1c82d3d19e13545)
Mark Fasheh [Tue, 30 Jun 2015 21:42:05 +0000 (14:42 -0700)]
btrfs: fix deadlock with extent-same and readpage
->readpage() does page_lock() before extent_lock(), we do the opposite in
extent-same. We want to reverse the order in btrfs_extent_same() but it's
not quite straightforward since the page locks are taken inside btrfs_cmp_data().
So I split btrfs_cmp_data() into 3 parts with a small context structure that
is passed between them. The first, btrfs_cmp_data_prepare() gathers up the
pages needed (taking page lock as required) and puts them on our context
structure. At this point, we are safe to lock the extent range. Afterwards,
we use btrfs_cmp_data() to do the data compare as usual and btrfs_cmp_data_free()
to clean up our context.
Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
(Cherry picked from commit f441460202cb787c49963bcc1f54cb48c52f7512)
Mark Fasheh [Tue, 30 Jun 2015 21:42:04 +0000 (14:42 -0700)]
btrfs: pass unaligned length to btrfs_cmp_data()
In the case that we dedupe the tail of a file, we might expand the dedupe
len out to the end of our last block. We don't want to compare data past
i_size however, so pass the original length to btrfs_cmp_data().
Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
(Cherry picked from commit 207910ddeeda38fd54544d94f8c8ca5a9632cc25)
The current test for bio vec merging is not fully accurate and can be
tricked into merging bios when certain grant combinations are used.
The result of these malicious bio merges is a bio that extends past
the memory page used by any of the originating bios.
Take into account the following scenario, where a guest creates two
grant references that point to the same mfn, ie: grant 1 -> mfn A,
grant 2 -> mfn A.
These references are then used in a PV block request, and mapped by
the backend domain, thus obtaining two different pfns that point to
the same mfn, pfn B -> mfn A, pfn C -> mfn A.
If those grants happen to be used in two consecutive sectors of a disk
IO operation becoming two different bios in the backend domain, the
checks in xen_biovec_phys_mergeable will succeed, because bfn1 == bfn2
(they both point to the same mfn). However due to the bio merging,
the backend domain will end up with a bio that expands past mfn A into
mfn A + 1.
Fix this by making sure the check in xen_biovec_phys_mergeable takes
into account the offset and the length of the bio, this basically
replicates whats done in __BIOVEC_PHYS_MERGEABLE using mfns (bus
addresses). While there also remove the usage of
__BIOVEC_PHYS_MERGEABLE, since that's already checked by the callers
of xen_biovec_phys_mergeable.
The upper dentry may become stale before we call ovl_lock_rename_workdir.
For example, someone could (mistakenly or maliciously) manually unlink(2)
it directly from upperdir.
To ensure it is not stale, let's lookup it after ovl_lock_rename_workdir
and and check if it matches the upper dentry.
Essentially, it is the same problem and similar solution as in
commit 11f3710417d0 ("ovl: verify upper dentry before unlink and rename").
Open the lower file with O_LARGEFILE in ovl_copy_up().
Pass O_LARGEFILE unconditionally in ovl_copy_up_data() as it's purely for
catching 32-bit userspace dealing with a file large enough that it'll be
mishandled if the application isn't aware that there might be an integer
overflow. Inside the kernel, there shouldn't be any problems.
Eric Sandeen [Fri, 21 Jul 2017 15:25:35 +0000 (10:25 -0500)]
xfs: toggle readonly state around xfs_log_mount_finish
When we do log recovery on a readonly mount, unlinked inode
processing does not happen due to the readonly checks in
xfs_inactive(), which are trying to prevent any I/O on a
readonly mount.
This is misguided - we do I/O on readonly mounts all the time,
for consistency; for example, log recovery. So do the same
RDONLY flag twiddling around xfs_log_mount_finish() as we
do around xfs_log_mount(), for the same reason.
This all cries out for a big rework but for now this is a
simple fix to an obvious problem.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Orabug: 26630113
Eric Sandeen [Fri, 21 Jul 2017 15:24:28 +0000 (10:24 -0500)]
xfs: write unmount record for ro mounts
There are dueling comments in the xfs code about intent
for log writes when unmounting a readonly filesystem.
In xfs_mountfs, we see the intent:
/*
* Now the log is fully replayed, we can transition to full read-only
* mode for read-only mounts. This will sync all the metadata and clean
* the log so that the recovery we just performed does not have to be
* replayed again on the next mount.
*/
and it calls xfs_quiesce_attr(), but by the time we get to
xfs_log_unmount_write(), it returns early for a RDONLY mount:
* Don't write out unmount record on read-only mounts.
Because of this, sequential ro mounts of a filesystem with
a dirty log will replay the log each time, which seems odd.
Fix this by writing an unmount record even for RO mounts, as long
as norecovery wasn't specified (don't write a clean log record
if a dirty log may still be there!) and the log device is
writable.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
Orabug: 26630113
Brian Foster [Sat, 28 Jan 2017 07:22:57 +0000 (23:22 -0800)]
xfs: fix eofblocks race with file extending async dio writes
It's possible for post-eof blocks to end up being used for direct I/O
writes. dio write performs an upfront unwritten extent allocation, sends
the dio and then updates the inode size (if necessary) on write
completion. If a file release occurs while a file extending dio write is
in flight, it is possible to mistake the post-eof blocks for speculative
preallocation and incorrectly truncate them from the inode. This means
that the resulting dio write completion can discover a hole and allocate
new blocks rather than perform unwritten extent conversion.
This requires a strange mix of I/O and is thus not likely to reproduce
in real world workloads. It is intermittently reproduced by generic/299.
The error manifests as an assert failure due to transaction overrun
because the aforementioned write completion transaction has only
reserved enough blocks for btree operations:
The root cause is that xfs_free_eofblocks() uses i_size to truncate
post-eof blocks from the inode, but async, file extending direct writes
do not update i_size until write completion, long after inode locks are
dropped. Therefore, xfs_free_eofblocks() effectively truncates the inode
to the incorrect size.
Update xfs_free_eofblocks() to serialize against dio similar to how
extending writes are serialized against i_size updates before post-eof
block zeroing. Specifically, wait on dio while under the iolock. This
ensures that dio write completions have updated i_size before post-eof
blocks are processed.
Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Orabug: 26588811
Keith Busch [Fri, 13 May 2016 18:38:09 +0000 (12:38 -0600)]
NVMe: Allocate queues only for online cpus
The driver previously requested allocating queues for the total possible
number of CPUs so that blk-mq could rebalance these if CPUs were added
after initialization. The number of hardware contexts can now be changed
at runtime, so we only need to allocate the number of online queues
since we can add more later.
Suggested-by: Jeff Lien <jeff.lien@hgst.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Orabug: 24675382
(cherry picked from commit 2800b8e7d9dfca1fd9d044dcf7a046b5de5a7239) Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Conflicts:
drivers/nvme/host/pci.c
Reviewed-by: Govinda Tatti<Govinda.Tatti@oracle.com> Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Eric Biggers [Thu, 8 Jun 2017 13:48:40 +0000 (14:48 +0100)]
KEYS: fix dereferencing NULL payload with nonzero length
sys_add_key() and the KEYCTL_UPDATE operation of sys_keyctl() allowed a
NULL payload with nonzero length to be passed to the key type's
->preparse(), ->instantiate(), and/or ->update() methods. Various key
types including asymmetric, cifs.idmap, cifs.spnego, and pkcs7_test did
not handle this case, allowing an unprivileged user to trivially cause a
NULL pointer dereference (kernel oops) if one of these key types was
present. Fix it by doing the copy_from_user() when 'plen' is nonzero
rather than when '_payload' is non-NULL, causing the syscall to fail
with EFAULT as expected when an invalid buffer is specified.
Cc: stable@vger.kernel.org # 2.6.10+ Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
Orabug: 26591890
(cherry picked from commit 5649645d725c73df4302428ee4e02c869248b4c5) Signed-off-by: Todd Vierling <todd.vierling@oracle.com>
Jack Morgenstein [Mon, 12 Sep 2016 16:16:21 +0000 (19:16 +0300)]
IB/mlx4: Fix possible vl/sl field mismatch in LRH header in QP1 packets
In MLX qp packets, the LRH (built by the driver) has both a VL field
and an SL field. When building a QP1 packet, the VL field should
reflect the SLtoVL mapping and not arbitrarily contain zero (as is
done now). This bug causes credit problems in IB switches at
high rates of QP1 packets.
The fix is to cache the SL to VL mapping in the driver, and look up
the VL mapped to the SL provided in the send request when sending
QP1 packets.
For FW versions which support generating a port_management_config_change
event with subtype sl-to-vl-table-change, the driver uses that event
to update its sl-to-vl mapping cache. Otherwise, the driver snoops
incoming SMP mads to update the cache.
There remains the case where the FW is running in secure-host mode
(so no QP0 packets are delivered to the driver), and the FW does not
generate the sl2vl mapping change event. To support this case, the
driver updates (via querying the FW) its sl2vl mapping cache when
running in secure-host mode when it receives either a Port Up event
or a client-reregister event (where the port is still up, but there
may have been an opensm failover).
OpenSM modifies the sl2vl mapping before Port Up and Client-reregister
events occur, so if there is a mapping change the driver's cache will
be properly updated.
Fixes: 225c7b1feef1 ("IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters") Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
[ Original fix ported from upstream commit fd10ed8 modified with
changes from merge commit b9044ac and modifications in
dump_dev_cap_flags2() likely to be made upstream for correctness
(and verified with private communication with Mellanox) ]
Revert needed to add code for proper handling of the port
management change event in a subsequent commit (instead
of a workaround that suppresses an error message for
an unhandled event).
When building with a dma_addr_t that is different from pointer size, we
get this warning:
drivers/scsi/megaraid/megaraid_sas_fusion.c: In function 'megasas_make_prp_nvme':
drivers/scsi/megaraid/megaraid_sas_fusion.c:1654:17: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
It's better to not pretend that the dma address is a pointer and instead
use a dma_addr_t consistently.
Fixes: 33203bc4d61b ("scsi: megaraid_sas: NVME fast path io support") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Sumit Saxena <sumit.saxena@broadcom.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit d1da522fb8a70b8c527d4ad15f9e62218cc00f2c) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
This patch provide true fast path IO support. Driver creates PRP for
NVME drives and send Fast Path for performance. Certain h/w requirement
needs to be taken care in driver.
Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 33203bc4d61b33f1f7bb736eac0c6fdd20b92397) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
drivers/scsi/megaraid/megaraid_sas.h
drivers/scsi/megaraid/megaraid_sas_fp.c
This patch fetch true values of NVME property from FW using New DCMD
interface MR_DCMD_DEV_GET_TARGET_PROP
Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 96188a89cc6d5ad3a0a5b7a6c4abc9f4a6de7678) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Adding detection logic for NVME device attached behind Ventura
controller. Driver set HostPageSize in IOC_INIT frame to inform about
page size for NVME devices. Firmware reports NVME page size to the
driver. PD INFO DCMD provide new interface type NVME_PD. Driver set
property of NVME device.
Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@broadcom.com> Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
[ Upstream commit 15dd03811d99dcf828f4eeb2c2b6a02ddc1201c7 ]
[ Resolved the use of blk_queue_virt_boundary just like broadcom driver version ] Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Conflicts:
drivers/scsi/megaraid/megaraid_sas.h
This commit reverts the changes done in 937287c2e4ab ("fs/fuse: Fix for
correct number of numa nodes") and fixes the original problem by calling
kmalloc_node/kmem_cache_alloc_node, only on the online numa nodes. For the
offline nodes, kmalloc or kmem_cache_alloc will be used for allocation.
This problem happens only with slub allocator because it does not do
fallback allocation if node is offline, unlike slab. And for that reason,
the same behavior is emulated here.
Martin K. Petersen [Fri, 19 May 2017 16:39:36 +0000 (12:39 -0400)]
scsi: scsi_debug: Avoid PI being disabled when TPGS is enabled
It was not possible to enable both T10 PI and TPGS because they share
the same byte in the INQUIRY response. Logically OR the TPGS value
instead of using assignment.
J. Bruce Fields [Fri, 5 May 2017 20:17:57 +0000 (16:17 -0400)]
nfsd: encoders mustn't use unitialized values in error cases
In error cases, lgp->lg_layout_type may be out of bounds; so we
shouldn't be using it until after the check of nfserr.
This was seen to crash nfsd threads when the server receives a LAYOUTGET
request with a large layout type.
GETDEVICEINFO has the same problem.
Reported-by: Ari Kauppi <Ari.Kauppi@synopsys.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
(cherry picked from commit f961e3f2acae94b727380c0b74e2d3954d0edf79)
Ari Kauppi [Fri, 5 May 2017 20:07:55 +0000 (16:07 -0400)]
nfsd: fix undefined behavior in nfsd4_layout_verify
UBSAN: Undefined behaviour in fs/nfsd/nfs4proc.c:1262:34
shift exponent 128 is too large for 32-bit type 'int'
Depending on compiler+architecture, this may cause the check for
layout_type to succeed for overly large values (which seems to be the
case with amd64). The large value will be later used in de-referencing
nfsd4_layout_ops for function pointers.
Reported-by: Jani Tuovila <tuovila@synopsys.com> Signed-off-by: Ari Kauppi <ari@synopsys.com>
[colin.king@canonical.com: use LAYOUT_TYPE_MAX instead of 32] Cc: stable@vger.kernel.org Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
(cherry picked from commit b550a32e60a4941994b437a8d662432a486235a5)
Fix by allowing EOPNOTSUPP as a valid return value from
vfs_removexattr(XATTR_NAME_POSIX_ACL_*). Upper filesystem may not support
ACL and still be perfectly able to support overlayfs.
Reported-by: Martin Ziegler <ziegler@uni-freiburg.de> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: c11b9fdd6a61 ("ovl: remove posix_acl_default from workdir") Cc: <stable@vger.kernel.org>
Orabug: 26401569
Be defensive about what underlying fs provides us in the returned xattr
list buffer. If it's not properly null terminated, bail out with a warning
insead of BUG.
Some operations (setxattr/chmod) can make the cached acl stale. We either
need to clear overlay's acl cache for the affected inode or prevent acl
caching on the overlay altogether. Preventing caching has the following
advantages:
- no double caching, less memory used
- overlay cache doesn't go stale when fs clears it's own cache
Possible disadvantage is performance loss. If that becomes a problem
get_acl() can be optimized for overlayfs.
This patch disables caching by pre setting i_*acl to a value that
- has bit 0 set, so is_uncached_acl() will return true
- is not equal to ACL_NOT_CACHED, so get_acl() will not overwrite it
(backport upstream commit 2a3a2a3f35249412e35fbb48b743348c40373409)
use ACL_NOT_CACHED instead of ACL_DONT_CACHE since the is_uncached_acl
is not available in the current kernel.
Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
ovl: handle umask and posix_acl_default correctly on creation
Setting MS_POSIXACL in sb->s_flags has the side effect of passing mode to
create functions without masking against umask.
Another problem when creating over a whiteout is that the default posix acl
is not inherited from the parent dir (because the real parent dir at the
time of creation is the work directory).
Fix these problems by:
a) If upper fs does not have MS_POSIXACL, then mask mode with umask.
b) If creating over a whiteout, call posix_acl_create() to get the
inherited acls. After creation (but before moving to the final
destination) set these acls on the created file. posix_acl_create() also
updates the file creation mode as appropriate.
When mounting overlayfs it needs a clean "work" directory under the
supplied workdir.
Previously the mount code removed this directory if it already existed and
created a new one. If the removal failed (e.g. directory was not empty)
then it fell back to a read-only mount not using the workdir.
While this has never been reported, it is possible to get a non-empty
"work" dir from a previous mount of overlayfs in case of crash in the
middle of an operation using the work directory.
In this case the left over state should be discarded and the overlay
filesystem will be consistent, guaranteed by the atomicity of operations on
moving to/from the workdir to the upper layer.
This patch implements cleaning out any files left in workdir. It is
implemented using real recursion for simplicity, but the depth is limited
to 2, because the worst case is that of a directory containing whiteouts
under "work".
Miklos Szeredi [Mon, 21 Mar 2016 16:31:46 +0000 (17:31 +0100)]
ovl: rename is_merge to is_lowest
The 'is_merge' is an historical naming from when only a single lower layer
could exist. With the introduction of multiple lower layers the meaning of
this flag was changed to mean only the "lowest layer" (while all lower
layers were being merged).
So now 'is_merge' is inaccurate and hence renaming to 'is_lowest'
Andreas Gruenbacher [Mon, 22 Aug 2016 15:22:11 +0000 (17:22 +0200)]
ovl: Switch to generic_removexattr
Commit d837a49bd57f ("ovl: fix POSIX ACL setting") switches from
iop->setxattr from ovl_setxattr to generic_setxattr, so switch from
ovl_removexattr to generic_removexattr as well. As far as permission
checking goes, the same rules should apply in either case.
While doing that, rename ovl_setxattr to ovl_xattr_set to indicate that
this is not an iop->setxattr implementation and remove the unused inode
argument.
Move ovl_other_xattr_set above ovl_own_xattr_set so that they match the
order of handlers in ovl_xattr_handlers.
Andreas Gruenbacher [Mon, 22 Aug 2016 14:36:49 +0000 (16:36 +0200)]
ovl: Get rid of ovl_xattr_noacl_handlers array
Use an ordinary #ifdef to conditionally include the POSIX ACL handlers
in ovl_xattr_handlers, like the other filesystems do. Flag the code
that is now only used conditionally with __maybe_unused.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569
Miklos Szeredi [Mon, 6 Jun 2016 14:21:37 +0000 (16:21 +0200)]
ovl: xattr filter fix
a) ovl_need_xattr_filter() is wrong, we can have multiple lower layers
overlaid, all of which (except the lowest one) honouring the
"trusted.overlay.opaque" xattr. So need to filter everything except the
bottom and the pure-upper layer.
b) we no longer can assume that inode is attached to dentry in
get/setxattr.
This patch unconditionally filters private xattrs to fix both of the above.
Performance impact for get/removexattrs is likely in the noise.
For listxattrs it might be measurable in pathological cases, but I very
much hope nobody cares. If they do, we'll fix it then.
Reported-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: b96809173e94 ("security_d_instantiate(): move to the point prior to attaching dentry to inode")
Orabug: 26401569
The commit 39b681f80(ovl: store real inode pointer in ->i_private)
adds a call to the WRITE_ONCE which generates "initialization makes pointer
from integer without a cast" warnings, fix it by coverting to the required type.
Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
The empty checking logic is duplicated in ovl_check_empty_and_clear() and
ovl_remove_and_whiteout(), except the condition for clearing whiteouts is
different:
ovl_check_empty_and_clear() checked for being upper
ovl_remove_and_whiteout() checked for merge OR lower
Move the intersection of those checks (upper AND merge) into
ovl_check_empty_and_clear() and simplify ovl_remove_and_whiteout().
Right now we remove MAY_WRITE/MAY_APPEND bits from mask if realfile is on
lower/. This is done as files on lower will never be written and will be
copied up. But to copy up a file, mounter should have MAY_READ permission
otherwise copy up will fail. So set MAY_READ in mask when MAY_WRITE is
reset.
Dan Walsh noticed this when he did access(lowerfile, W_OK) and it returned
True (context mounts) but when he tried to actually write to file, it
failed as mounter did not have permission on lower file.
[SzM] don't set MAY_READ if only MAY_APPEND is set without MAY_WRITE; this
won't trigger a copy-up.
ovl: dilute permission checks on lower only if not special file
Right now if file is on lower/, we remove MAY_WRITE/MAY_APPEND bits from
mask as lower/ will never be written and file will be copied up. But this
is not true for special files. These files are not copied up and are opened
in place. So don't dilute the checks for these types of files.
1) Some permission checks are done by ->setxattr() which now uses mounter's
creds ("ovl: do operations on underlying file system in mounter's
context"). These permission checks need to be done with current cred as
well.
2) Setting ACL can fail for various reasons. We do not need to copy up in
these cases.
In the mean time switch to using generic_setxattr.
[Arnd Bergmann] Fix link error without POSIX ACL. posix_acl_from_xattr()
doesn't have a 'static inline' implementation when CONFIG_FS_POSIX_ACL is
disabled, and I could not come up with an obvious way to do it.
This instead avoids the link error by defining two sets of ACL operations
and letting the compiler drop one of the two at compile time depending
on CONFIG_FS_POSIX_ACL. This avoids all references to the ACL code,
also leading to smaller code.
Inode attributes are copied up to overlay inode (uid, gid, mode, atime,
mtime, ctime) so generic code using these fields works correcty. If a hard
link is created in overlayfs separate inodes are allocated for each link.
If chmod/chown/etc. is performed on one of the links then the inode
belonging to the other ones won't be updated.
This patch attempts to fix this by sharing inodes for hard links.
Use inode hash (with real inode pointer as a key) to make sure overlay
inodes are shared for hard links on upper. Hard links on lower are still
split (which is not user observable until the copy-up happens, see
Documentation/filesystems/overlayfs.txt under "Non-standard behavior").
The inode is only inserted in the hash if it is non-directoy and upper.
To get from overlay inode to real inode we currently use 'struct
ovl_entry', which has lifetime connected to overlay dentry. This is okay,
since each overlay dentry had a new overlay inode allocated.
Following patch will break that assumption, so need to leave out ovl_entry.
This patch stores the real inode directly in i_private, with the lowest bit
used to indicate whether the inode is upper or lower.
Lifetime rules remain, using ovl_inode_real() must only be done while
caller holds ref on overlay dentry (and hence on real dentry), or within
RCU protected regions.
This patch adds an i_op->update_time() handler to overlayfs inodes. This
forwards atime updates to the upper layer only. No atime updates are done
on lower layers.
Remove implicit atime updates to underlying files and directories with
O_NOATIME. Remove explicit atime update in ovl_readlink().
Clear atime related mnt flags from cloned upper mount. This means atime
updates are controlled purely by overlayfs mount options.
Reported-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569
This patch is a fix to the conflict of the upstream commit bb0d2b8ad29
(ovl: fix sgid on directory), convert the lock type to match with the
lock usage of the current kernel.
Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
When creating directory in workdir, the group/sgid inheritance from the
parent dir was omitted completely. Fix this by calling inode_init_owner()
on overlay inode and using the resulting uid/gid/mode to create the file.
Unfortunately the sgid bit can be stripped off due to umask, so need to
reset the mode in this case in workdir before moving the directory in
place.
The fact that we always do permission checking on the overlay inode and
clear MAY_WRITE for checking access to the lower inode allows cruft to be
removed from ovl_permission().
1) "default_permissions" option effectively did generic_permission() on the
overlay inode with i_mode, i_uid and i_gid updated from underlying
filesystem. This is what we do by default now. It did the update using
vfs_getattr() but that's only needed if the underlying filesystem can
change (which is not allowed). We may later introduce a "paranoia_mode"
that verifies that mode/uid/gid are not changed.
2) splitting out the IS_RDONLY() check from inode_permission() also becomes
unnecessary once we remove the MAY_WRITE from the lower inode check.
ovl: do not require mounter to have MAY_WRITE on lower
Now we have two levels of checks in ovl_permission(). overlay inode
is checked with the creds of task while underlying inode is checked
with the creds of mounter.
Looks like mounter does not have to have WRITE access to files on lower/.
So remove the MAY_WRITE from access mask for checks on underlying
lower inode.
This means task should still have the MAY_WRITE permission on lower
inode and mounter is not required to have MAY_WRITE.
It also solves the problem of read only NFS mounts being used as lower.
If __inode_permission(lower_inode, MAY_WRITE) is called on read only
NFS, it fails. By resetting MAY_WRITE, check succeeds and case of
read only NFS shold work with overlay without having to specify any
special mount options (default permission).
ovl: do operations on underlying file system in mounter's context
Given we are now doing checks both on overlay inode as well underlying
inode, we should be able to do checks and operations on underlying file
system using mounter's context.
So modify all operations to do checks/operations on underlying dentry/inode
in the context of mounter.
Miklos Szeredi [Wed, 15 Jun 2016 12:18:59 +0000 (14:18 +0200)]
ovl: fix uid/gid when creating over whiteout
Fix a regression when creating a file over a whiteout. The new
file/directory needs to use the current fsuid/fsgid, not the ones from the
mounter's credentials.
The refcounting is a bit tricky: prepare_creds() sets an original refcount,
override_creds() gets one more, which revert_cred() drops. So
1) we need to expicitly put the mounter's credentials when overriding
with the updated one
2) we need to put the original ref to the updated creds (and this can
safely be done before revert_creds(), since we'll still have the ref
from override_creds()).
Reported-by: Stephen Smalley <sds@tycho.nsa.gov> Fixes: 3fe6e52f0626 ("ovl: override creds with the ones from the superblock mounter") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 26401569
ovl: modify ovl_permission() to do checks on two inodes
Right now ovl_permission() calls __inode_permission(realinode), to do
permission checks on real inode and no checks are done on overlay inode.
Modify it to do checks both on overlay inode as well as underlying inode.
Checks on overlay inode will be done with the creds of calling task while
checks on underlying inode will be done with the creds of mounter.
Now we are planning to do DAC permission checks on overlay inode
itself. And to make it work, we will need to make sure we can get acls from
underlying inode. So define ->get_acl() for overlay inodes and this in turn
calls into underlying filesystem to get acls, if any.
Vivek Goyal [Thu, 16 Jun 2016 14:09:14 +0000 (10:09 -0400)]
ovl: move some common code in a function
ovl_create_upper() and ovl_create_over_whiteout() seem to be sharing some
common code which can be moved into a separate function. No functionality
change.
The hash salting changes meant that we can no longer reuse the hash in the
overlay dentry to look up the underlying dentry.
Instead of lookup_hash(), use lookup_one_len_unlocked() and swith to
mounter's creds (like we do for all other operations later in the series).
Now the lookup_hash() export introduced in 4.6 by 3c9fe8cdff1b ("vfs: add
lookup_hash() helper") is unused and can possibly be removed; its
usefulness negated by the hash salting and the idea that mounter's creds
should be used on operations on underlying filesystems.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: 8387ff2577eb ("vfs: make the string hashes salt the hash")
Orabug: 26401569
Miklos Szeredi [Tue, 10 May 2016 23:16:37 +0000 (01:16 +0200)]
ovl: ignore permissions on underlying lookup
Generally permission checking is not necessary when overlayfs looks up a
dentry on one of the underlying layers, since search permission on base
directory was already checked in ovl_permission().
More specifically using lookup_one_len() causes a problem when the lower
directory lacks search permission for a specific user while the upper
directory does have search permission. Since lookups are cached, this
causes inconsistency in behavior: success depends on who did the first
lookup.
So instead use lookup_hash() which doesn't do the permission check.
Interestingly, if you don't place files in merged/dir you can remove it,
meaning if upper/dir does not exist, creating the char device file works
properly in that same location.
This patch uses ovl_sb_creator_cred() to get the cred struct from the
superblock mounter and override the old cred with these new ones so that
the whiteout creation is possible because overlay is wrong in assuming that
the creds it will get with prepare_creds will be in the initial user
namespace. The old cap_raise game is removed in favor of just overriding
the old cred struct.
This patch also drops from ovl_copy_up_one() the following two lines:
Miklos Szeredi [Wed, 29 Jun 2016 06:26:59 +0000 (08:26 +0200)]
ovl: fix dentry leak for default_permissions
When using the 'default_permissions' mount option, ovl_permission() on
non-directories was missing a dput(alias), resulting in "BUG Dentry still
in use".
Miklos Szeredi [Mon, 12 Oct 2015 13:56:20 +0000 (15:56 +0200)]
ovl: fix open in stacked overlay
If two overlayfs filesystems are stacked on top of each other, then we need
recursion in ovl_d_select_inode().
I guess d_backing_inode() is supposed to do that. But currently it doesn't
and that functionality is open coded in vfs_open(). This is now copied
into ovl_d_select_inode() to fix this regression.
Reported-by: Alban Crequy <alban.crequy@gmail.com> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Fixes: 4bacc9c9234c ("overlayfs: Make f_path always point to the overlay...") Cc: David Howells <dhowells@redhat.com> Cc: <stable@vger.kernel.org> # v4.2+
Orabug: 26401569
NeilBrown [Thu, 7 Jan 2016 21:08:20 +0000 (16:08 -0500)]
nfsd: don't hold i_mutex over userspace upcalls
We need information about exports when crossing mountpoints during
lookup or NFSv4 readdir. If we don't already have that information
cached, we may have to ask (and wait for) rpc.mountd.
In both cases we currently hold the i_mutex on the parent of the
directory we're asking rpc.mountd about. We've seen situations where
rpc.mountd performs some operation on that directory that tries to take
the i_mutex again, resulting in deadlock.
With some care, we may be able to avoid that in rpc.mountd. But it
seems better just to avoid holding a mutex while waiting on userspace.
It appears that lookup_one_len is pretty much the only operation that
needs the i_mutex. So we could just drop the i_mutex elsewhere and do
something like
mutex_lock()
lookup_one_len()
mutex_unlock()
In many cases though the lookup would have been cached and not required
the i_mutex, so it's more efficient to create a lookup_one_len() variant
that only takes the i_mutex when necessary.
Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Orabug: 26401569
Another deadlock path caused by recursive locking is reported. This
kind of issue was introduced since commit 743b5f1434f5 ("ocfs2: take
inode lock in ocfs2_iop_set/get_acl()"). Two deadlock paths have been
fixed by commit b891fa5024a9 ("ocfs2: fix deadlock issue when taking
inode lock at vfs entry points"). Yes, we intend to fix this kind of
case in incremental way, because it's hard to find out all possible
paths at once.
This one can be reproduced like this. On node1, cp a large file from
home directory to ocfs2 mountpoint. While on node2, run
setfacl/getfacl. Both nodes will hang up there. The backtraces:
Fix this one by using ocfs2_inode_{lock|unlock}_tracker, which is
exported by commit 439a36b8ef38 ("ocfs2/dlmglue: prepare tracking logic
to avoid recursive cluster lock").
Link: http://lkml.kernel.org/r/20170622014746.5815-1-zren@suse.com Fixes: 743b5f1434f5 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()") Signed-off-by: Eric Ren <zren@suse.com> Reported-by: Thomas Voegtle <tv@lio96.de> Tested-by: Thomas Voegtle <tv@lio96.de> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherrypicked from commit 8818efaaacb78c60a9d90c5705b6c99b75d7d442) Signed-off-by: Ashish Samant <ashish.samant@oracle.com>