There are several race conditions while freezing queue.
When unfreezing queue, there is a small window between decrementing
q->mq_freeze_depth to zero and percpu_ref_reinit() call with
q->mq_usage_counter. If the other calls blk_mq_freeze_queue_start()
in the window, q->mq_freeze_depth is increased from zero to one and
percpu_ref_kill() is called with q->mq_usage_counter which is already
killed. percpu refcount should be re-initialized before killed again.
Also, there is a race condition while switching to percpu mode.
percpu_ref_switch_to_percpu() and percpu_ref_kill() must not be
executed at the same time as the following scenario is possible:
1. q->mq_usage_counter is initialized in atomic mode.
(atomic counter: 1)
2. After the disk registration, a process like systemd-udev starts
accessing the disk, and successfully increases refcount successfully
by percpu_ref_tryget_live() in blk_mq_queue_enter().
(atomic counter: 2)
3. In the final stage of initialization, q->mq_usage_counter is being
switched to percpu mode by percpu_ref_switch_to_percpu() in
blk_mq_finish_init(). But if CONFIG_PREEMPT_VOLUNTARY is enabled,
the process is rescheduled in the middle of switching when calling
wait_event() in __percpu_ref_switch_to_percpu().
(atomic counter: 2)
4. CPU hotplug handling for blk-mq calls percpu_ref_kill() to freeze
request queue. q->mq_usage_counter is decreased and marked as
DEAD. Wait until all requests have finished.
(atomic counter: 1)
5. The process rescheduled in the step 3. is resumed and finishes
all remaining work in __percpu_ref_switch_to_percpu().
A bias value is added to atomic counter of q->mq_usage_counter.
(atomic counter: PERCPU_COUNT_BIAS + 1)
6. A request issed in the step 2. is finished and q->mq_usage_counter
is decreased by blk_mq_queue_exit(). q->mq_usage_counter is DEAD,
so atomic counter is decreased and no release handler is called.
(atomic counter: PERCPU_COUNT_BIAS)
7. CPU hotplug handling in the step 4. will wait forever as
q->mq_usage_counter will never be zero.
Also, percpu_ref_reinit() and percpu_ref_kill() must not be executed
at the same time. Because both functions could call
__percpu_ref_switch_to_percpu() which adds the bias value and
initialize percpu counter.
Fix those races by serializing with per-queue mutex.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Ming Lei <tom.leiming@gmail.com>
(cherry picked from https://patchwork.kernel.org/patch/7269471/)
Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
The dummy ruleset I used to test the original validation change was broken,
most rules were unreachable and were not tested by mark_source_chains().
In some cases rulesets that used to load in a few seconds now require
several minutes.
sample ruleset that shows the behaviour:
echo "*filter"
for i in $(seq 0 100000);do
printf ":chain_%06x - [0:0]\n" $i
done
for i in $(seq 0 100000);do
printf -- "-A INPUT -j chain_%06x\n" $i
printf -- "-A INPUT -j chain_%06x\n" $i
printf -- "-A INPUT -j chain_%06x\n" $i
done
echo COMMIT
[ pipe result into iptables-restore ]
This ruleset will be about 74mbyte in size, with ~500k searches
though all 500k[1] rule entries. iptables-restore will take forever
(gave up after 10 minutes)
Instead of always searching the entire blob for a match, fill an
array with the start offsets of every single ipt_entry struct,
then do a binary search to check if the jump target is present or not.
After this change ruleset restore times get again close to what one
gets when reverting 36472341017529e (~3 seconds on my workstation).
[1] every user-defined rule gets an implicit RETURN, so we get
300k jumps + 100k userchains + 100k returns -> 500k rule entries
Fixes: 36472341017529e ("netfilter: x_tables: validate targets of jumps") Reported-by: Jeff Wu <wujiafu@gmail.com> Tested-by: Jeff Wu <wujiafu@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit 2686f12b26e217befd88357cf84e78d0ab3c86a1) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Make sure the table names via getsockopt GET_ENTRIES is nul-terminated
in ebtables and all the x_tables variants and their respective compat
code. Uncovered by KASAN.
Reported-by: Baozeng Ding <sploving1@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit b301f2538759933cf9ff1f7c4f968da72e3f0757) Signed-off-by: Brian Maly <brian.maly@oracle.com>
This looks like refactoring, but its also a bug fix.
Problem is that the compat path (32bit iptables, 64bit kernel) lacks a few
sanity tests that are done in the normal path.
For example, we do not check for underflows and the base chain policies.
While its possible to also add such checks to the compat path, its more
copy&pastry, for instance we cannot reuse check_underflow() helper as
e->target_offset differs in the compat case.
Other problem is that it makes auditing for validation errors harder; two
places need to be checked and kept in sync.
At a high level 32 bit compat works like this:
1- initial pass over blob:
validate match/entry offsets, bounds checking
lookup all matches and targets
do bookkeeping wrt. size delta of 32/64bit structures
assign match/target.u.kernel pointer (points at kernel
implementation, needed to access ->compatsize etc.)
2- allocate memory according to the total bookkeeping size to
contain the translated ruleset
3- second pass over original blob:
for each entry, copy the 32bit representation to the newly allocated
memory. This also does any special match translations (e.g.
adjust 32bit to 64bit longs, etc).
4- check if ruleset is free of loops (chase all jumps)
5-first pass over translated blob:
call the checkentry function of all matches and targets.
The alternative implemented by this patch is to drop steps 3&4 from the
compat process, the translation is changed into an intermediate step
rather than a full 1:1 translate_table replacement.
In the 2nd pass (step #3), change the 64bit ruleset back to a kernel
representation, i.e. put() the kernel pointer and restore ->u.user.name .
This gets us a 64bit ruleset that is in the format generated by a 64bit
iptables userspace -- we can then use translate_table() to get the
'native' sanity checks.
This has two drawbacks:
1. we re-validate all the match and target entry structure sizes even
though compat translation is supposed to never generate bogus offsets.
2. we put and then re-lookup each match and target.
THe upside is that we get all sanity tests and ruleset validations
provided by the normal path and can remove some duplicated compat code.
iptables-restore time of autogenerated ruleset with 300k chains of form
-A CHAIN0001 -m limit --limit 1/s -j CHAIN0002
-A CHAIN0002 -m limit --limit 1/s -j CHAIN0003
shows no noticeable differences in restore times:
old: 0m30.796s
new: 0m31.521s
64bit: 0m25.674s
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit af815d264b7ed1cdceb0e1fdf4fa7d3bd5fe9a99) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Quoting John Stultz:
In updating a 32bit arm device from 4.6 to Linus' current HEAD, I
noticed I was having some trouble with networking, and realized that
/proc/net/ip_tables_names was suddenly empty.
Digging through the registration process, it seems we're catching on the:
Validate that all matches (if any) add up to the beginning of
the target and that each match covers at least the base structure size.
The compat path should be able to safely re-use the function
as the structures only differ in alignment; added a
BUILD_BUG_ON just in case we have an arch that adds padding as well.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit a605e7476c66a13312189026e6977bad6ed3050d) Signed-off-by: Brian Maly <brian.maly@oracle.com>
We're currently asserting that targetoff + targetsize <= nextoff.
Extend it to also check that targetoff is >= sizeof(xt_entry).
Since this is generic code, add an argument pointing to the start of the
match/target, we can then derive the base structure size from the delta.
We also need the e->elems pointer in a followup change to validate matches.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 451e4403bc4abc51539376d4314baa739ab9e996) Signed-off-by: Brian Maly <brian.maly@oracle.com>
We have targets and standard targets -- the latter carries a verdict.
The ip/ip6tables validation functions will access t->verdict for the
standard targets to fetch the jump offset or verdict for chainloop
detection, but this happens before the targets get checked/validated.
Thus we also need to check for verdict presence here, else t->verdict
can point right after a blob.
Spotted with UBSAN while testing malformed blobs.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 73bfda1c492bef7038a87adfa887b7e6b7cd6679) Signed-off-by: Brian Maly <brian.maly@oracle.com>
32bit rulesets have different layout and alignment requirements, so once
more integrity checks get added to xt_check_entry_offsets it will reject
well-formed 32bit rulesets.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit acbcf85306bd563910a2afe07f07d30381b031b0) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Once we add more sanity testing to xt_check_entry_offsets it
becomes relvant if we're expecting a 32bit 'config_compat' blob
or a normal one.
Since we already have a lot of similar-named functions (check_entry,
compat_check_entry, find_and_check_entry, etc.) and the current
incarnation is short just fold its contents into the callers.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 801cd32774d12dccfcfc0c22b0b26d84ed995c6f) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Currently arp/ip and ip6tables each implement a short helper to check that
the target offset is large enough to hold one xt_entry_target struct and
that t->u.target_size fits within the current rule.
Unfortunately these checks are not sufficient.
To avoid adding new tests to all of ip/ip6/arptables move the current
checks into a helper, then extend this helper in followup patches.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit a471ac817cf0e0d6e87779ca1fee216ba849e613) Signed-off-by: Brian Maly <brian.maly@oracle.com>
In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
is possible for a user-supplied ipt_entry structure to have a large
next_offset field. This field is not bounds checked prior to writing a
counter value at the supplied offset.
Problem is that mark_source_chains should not have been called --
the rule doesn't have a next entry, so its supposed to return
an absolute verdict of either ACCEPT or DROP.
However, the function conditional() doesn't work as the name implies.
It only checks that the rule is using wildcard address matching.
However, an unconditional rule must also not be using any matches
(no -m args).
The underflow validator only checked the addresses, therefore
passing the 'unconditional absolute verdict' test, while
mark_source_chains also tested for presence of matches, and thus
proceeeded to the next (not-existent) rule.
Unify this so that all the callers have same idea of 'unconditional rule'.
Reported-by: Ben Hawkes <hawkes@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 850c377e0e2d76723884d610ff40827d26aa21eb) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
net/ipv4/netfilter/arp_tables.c
In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
is possible for a user-supplied ipt_entry structure to have a large
next_offset field. This field is not bounds checked prior to writing a
counter value at the supplied offset.
Base chains enforce absolute verdict.
User defined chains are supposed to end with an unconditional return,
xtables userspace adds them automatically.
But if such return is missing we will move to non-existent next rule.
Reported-by: Ben Hawkes <hawkes@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit cf756388f8f34e02a338356b3685c46938139871) Signed-off-by: Brian Maly <brian.maly@oracle.com>
In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
is possible for a user-supplied ipt_entry structure to have a large
next_offset field. This field is not bounds checked prior to writing a
counter value at the supplied offset.
Problem is that mark_source_chains should not have been called --
the rule doesn't have a next entry, so its supposed to return
an absolute verdict of either ACCEPT or DROP.
However, the function conditional() doesn't work as the name implies.
It only checks that the rule is using wildcard address matching.
However, an unconditional rule must also not be using any matches
(no -m args).
The underflow validator only checked the addresses, therefore
passing the 'unconditional absolute verdict' test, while
mark_source_chains also tested for presence of matches, and thus
proceeeded to the next (not-existent) rule.
Unify this so that all the callers have same idea of 'unconditional rule'.
Reported-by: Ben Hawkes <hawkes@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 850c377e0e2d76723884d610ff40827d26aa21eb) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Ben Hawkes says:
integer overflow in xt_alloc_table_info, which on 32-bit systems can
lead to small structure allocation and a copy_from_user based heap
corruption.
Reported-by: Ben Hawkes <hawkes@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Otherwise this function may read data beyond the ruleset blob.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit 6e94e0cfb0887e4013b3b930fa6ab1fe6bb6ba91) Signed-off-by: Brian Maly <brian.maly@oracle.com>
The root cause is that CPU_UP_PREPARE is completely the wrong notifier
action from which to access cpu_data(), because smp_store_cpu_info()
won't have been executed by the target CPU at that point, which in turn
means that ->x86_cache_max_rmid and ->x86_cache_occ_scale haven't been
filled out.
Instead let's invoke our handler from CPU_STARTING and rename it
appropriately.
Reported-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ashok Raj <ashok.raj@intel.com> Cc: Kanaka Juvva <kanaka.d.juvva@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vikas Shivappa <vikas.shivappa@intel.com> Link: http://lkml.kernel.org/r/1438863163-14083-1-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
Orabug: 24745516
(cherry picked from commit d7a702f0b1033cf402fef65bd6395072738f0844) Acked-by: Chuck Anderson <chuck.anderson@oracle.com>
Michal Hocko [Tue, 2 Aug 2016 21:02:34 +0000 (14:02 -0700)]
mm, hugetlb: fix huge_pte_alloc BUG_ON
Zhong Jiang has reported a BUG_ON from huge_pte_alloc hitting when he
runs his database load with memory online and offline running in
parallel. The reason is that huge_pmd_share might detect a shared pmd
which is currently migrated and so it has migration pte which is
!pte_huge.
There doesn't seem to be any easy way to prevent from the race and in
fact seeing the migration swap entry is not harmful. Both callers of
huge_pte_alloc are prepared to handle them. copy_hugetlb_page_range
will copy the swap entry and make it COW if needed. hugetlb_fault will
back off and so the page fault is retries if the page is still under
migration and waits for its completion in hugetlb_fault.
That means that the BUG_ON is wrong and we should update it. Let's
simply check that all present ptes are pte_huge instead.
Link: http://lkml.kernel.org/r/20160721074340.GA26398@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: zhongjiang <zhongjiang@huawei.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 24691289
(cherry picked from commit 4e666314d286765a9e61818b488c7372326654ec) Acked-by: Chuck Anderson <chuck.anderson@oracle.com>
When unmapped hw queue is remapped after CPU topology is changed,
hctx->tags->cpumask has to be set after hctx->tags is setup in
blk_mq_map_swqueue(), otherwise it causes null pointer dereference.
Fixes: f26cdc8536 ("blk-mq: Shared tag enhancements") Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 1356aae08338f1c19ce1c67bf8c543a267688fc3) Signed-off-by: Bob Liu <bob.liu@oracle.com>
Currently q->mq_ops is used widely to decide if the queue
is mq or not, so we should set the 'flag' asap so that both
block core and drivers can get the correct mq info.
For example, commit 868f2f0b720(blk-mq: dynamic h/w context count)
moves the hctx's initialization before setting q->mq_ops in
blk_mq_init_allocated_queue(), then cause blk_alloc_flush_queue()
to think the queue is non-mq and don't allocate command size
for the per-hctx flush rq.
This patches should fix the problem reported by Sasha.
Cc: Keith Busch <keith.busch@intel.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Fixes: 868f2f0b720 ("blk-mq: dynamic h/w context count") Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 66841672161efb9e3be4a1dbd9755020bb1d86b7) Signed-off-by: Dan Duval <dan.duval@oracle.com>
In the listxattrs handler, we were not listing all the xattrs that are
packed in the same btree item, which happens when multiple xattrs have
a name that when crc32c hashed produce the same checksum value.
Fix this by processing them all.
The following test case for xfstests reproduces the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
cd /
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/attr
# real QA test starts here
_supported_fs generic
_supported_os Linux
_require_scratch
_require_attrs
rm -f $seqres.full
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount
# Create our test file with a few xattrs. The first 3 xattrs have a name
# that when given as input to a crc32c function result in the same checksum.
# This made btrfs list only one of the xattrs through listxattrs system call
# (because it packs xattrs with the same name checksum into the same btree
# item).
touch $SCRATCH_MNT/testfile
$SETFATTR_PROG -n user.foobar -v 123 $SCRATCH_MNT/testfile
$SETFATTR_PROG -n user.WvG1c1Td -v qwerty $SCRATCH_MNT/testfile
$SETFATTR_PROG -n user.J3__T_Km3dVsW_ -v hello $SCRATCH_MNT/testfile
$SETFATTR_PROG -n user.something -v pizza $SCRATCH_MNT/testfile
$SETFATTR_PROG -n user.ping -v pong $SCRATCH_MNT/testfile
# Now call getfattr with --dump, which calls the listxattrs system call.
# It should list all the xattrs we have set before.
$GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/testfile | _filter_scratch
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit daac7ba61a0d338c66b70c47d205ba7465718155) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
When listing a inode's xattrs we have a time window where we race against
a concurrent operation for adding a new hard link for our inode that makes
us not return any xattr to user space. In order for this to happen, the
first xattr of our inode needs to be at slot 0 of a leaf and the previous
leaf must still have room for an inode ref (or extref) item, and this can
happen because an inode's listxattrs callback does not lock the inode's
i_mutex (nor does the VFS does it for us), but adding a hard link to an
inode makes the VFS lock the inode's i_mutex before calling the inode's
link callback.
The race illustrated by the following sequence diagram is possible:
CPU 1 CPU 2
btrfs_listxattr()
searches for key (257 XATTR_ITEM 0)
gets path with path->nodes[0] == leaf X
and path->slots[0] == N
because path->slots[0] is >=
btrfs_header_nritems(leaf X), it calls
btrfs_next_leaf()
btrfs_next_leaf()
releases the path
adds key (257 INODE_REF 666)
to the end of leaf X (slot N),
and leaf X now has N + 1 items
searches for the key (257 INODE_REF 256),
with path->keep_locks == 1, because that
is the last key it saw in leaf X before
releasing the path
ends up at leaf X again and it verifies
that the key (257 INODE_REF 256) is no
longer the last key in leaf X, so it
returns with path->nodes[0] == leaf X
and path->slots[0] == N, pointing to
the new item with key (257 INODE_REF 666)
btrfs_listxattr's loop iteration sees that
the type of the key pointed by the path is
different from the type BTRFS_XATTR_ITEM_KEY
and so it breaks the loop and stops looking
for more xattr items
--> the application doesn't get any xattr
listed for our inode
So fix this by breaking the loop only if the key's type is greater than
BTRFS_XATTR_ITEM_KEY and skip the current key if its type is smaller.
Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit f1cd1f0b7d1b5d4aaa5711e8f4e4898b0045cb6d) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
If we move a directory to a new parent and later log that parent and don't
explicitly log the old parent, when we replay the log we can end up with
entries for the moved directory in both the old and new parent directories.
Besides being ilegal to have directories with multiple hard links in linux,
it also resulted in the leaving the inode item with a link count of 1.
A similar issue also happens if we move a regular file - after the log tree
is replayed the file has a link in both the old and new parent directories,
when it should be only at the new directory.
Sample reproducer:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ mkdir /mnt/x
$ mkdir /mnt/y
$ touch /mnt/x/foo
$ mkdir /mnt/y/z
$ sync
$ ln /mnt/x/foo /mnt/x/bar
$ mv /mnt/y/z /mnt/x/z
< power fail >
$ mount /dev/sdc /mnt
$ ls -1Ri /mnt
/mnt:
257 x
258 y
/mnt/x:
259 bar
259 foo
260 z
/mnt/x/z:
/mnt/y:
260 z
/mnt/y/z:
$ umount /dev/sdc
$ btrfs check /dev/sdc
Checking filesystem on /dev/sdc
UUID: a67e2c4a-a4b4-4fdc-b015-9d9af1e344be
checking extents
checking free space cache
checking fs roots
root 5 inode 260 errors 2000, link count wrong
unresolved ref dir 257 index 4 namelen 1 name z filetype 2 errors 0
unresolved ref dir 258 index 2 namelen 1 name z filetype 2 errors 0
(...)
Attempting to remove the directory becomes impossible:
$ mount /dev/sdc /mnt
$ rmdir /mnt/y/z
$ ls -lh /mnt/y
ls: cannot access /mnt/y/z: No such file or directory
total 0
d????????? ? ? ? ? ? z
$ rmdir /mnt/x/z
rmdir: failed to remove ‘/mnt/x/z’: Stale file handle
$ ls -lh /mnt/x
ls: cannot access /mnt/x/z: Stale file handle
total 0
-rw-r--r-- 2 root root 0 Apr 6 18:06 bar
-rw-r--r-- 2 root root 0 Apr 6 18:06 foo
d????????? ? ? ? ? ? z
So make sure that on rename we set the last_unlink_trans value for our
inode, even if it's a directory, to the value of the current transaction's
ID and that if the new parent directory is logged that we fallback to a
transaction commit.
A test case for fstests is being submitted as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit 657ed1aa4898c8304500e0d13f240d5a67e8be5f) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
We have two cases where we end up deleting a file at log replay time
when we should not. For this to happen the file must have been renamed
and a directory inode must have been fsynced/logged.
Two examples that exercise these two cases are listed below.
Case 1)
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkdir -p /mnt/a/b
$ mkdir /mnt/c
$ touch /mnt/a/b/foo
$ sync
$ mv /mnt/a/b/foo /mnt/c/
# Create file bar just to make sure the fsync on directory a/ does
# something and it's not a no-op.
$ touch /mnt/a/bar
$ xfs_io -c "fsync" /mnt/a
< power fail / crash >
The next time the filesystem is mounted, the log replay procedure
deletes file foo.
The next time the filesystem is mounted, the log replay procedure
deletes file bar.
The reason why the files are deleted is because when we log inodes
other then the fsync target inode, we ignore their last_unlink_trans
value and leave the log without enough information to later replay the
rename operations. So we need to look at the last_unlink_trans values
and fallback to a transaction commit if they are greater than the
id of the last committed transaction.
So fix this by looking at the last_unlink_trans values and fallback to
transaction commits when needed. Also, when logging other inodes (for
case 1 we logged descendants of the fsync target inode while for case 2
we logged ascendants) we need to care about concurrent tasks updating
the last_unlink_trans of inodes we are logging (which was already an
existing problem in check_parent_dirs_for_sync()). Since we can not
acquire their inode mutex (vfs' struct inode ->i_mutex), as that causes
deadlocks with other concurrent operations that acquire the i_mutex of
2 inodes (other fsyncs or renames for example), we need to serialize on
the log_mutex of the inode we are logging. A task setting a new value for
an inode's last_unlink_trans must acquire the inode's log_mutex and it
must do this update before doing the actual unlink operation (which is
already the case except when deleting a snapshot). Conversely the task
logging the inode must first log the inode and then check the inode's
last_unlink_trans value while holding its log_mutex, as if its value is
not greater then the id of the last committed transaction it means it
logged a safe state of the inode's items, while if its value is not
smaller then the id of the last committed transaction it means the inode
state it has logged might not be safe (the concurrent task might have
just updated last_unlink_trans but hasn't done yet the unlink operation)
and therefore a transaction commit must be done.
Test cases for xfstests follow in separate patches.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 2be63d5ce929603d4e7cedabd9e992eb34a0ff95) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
We have one more case where after a log tree is replayed we get
inconsistent metadata leading to stale directory entries, due to
some directories having entries pointing to some inode while the
inode does not have a matching BTRFS_INODE_[REF|EXTREF]_KEY item.
To trigger the problem we need to have a file with multiple hard links
belonging to different parent directories. Then if one of those hard
links is removed and we fsync the file using one of its other links
that belongs to a different parent directory, we end up not logging
the fact that the removed hard link doesn't exists anymore in the
parent directory.
Simple reproducer:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
_cleanup_flakey
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_need_to_be_root
_supported_fs generic
_supported_os Linux
_require_scratch
_require_dm_flakey
_require_metadata_journaling $SCRATCH_DEV
# Create our test directory and file.
mkdir $SCRATCH_MNT/testdir
touch $SCRATCH_MNT/foo
ln $SCRATCH_MNT/foo $SCRATCH_MNT/testdir/foo2
ln $SCRATCH_MNT/foo $SCRATCH_MNT/testdir/foo3
# Make sure everything done so far is durably persisted.
sync
# Now we remove one of our file's hardlinks in the directory testdir.
unlink $SCRATCH_MNT/testdir/foo3
# We now fsync our file using the "foo" link, which has a parent that
# is not the directory "testdir".
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
# Silently drop all writes and unmount to simulate a crash/power
# failure.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again, mount to trigger journal/log replay.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# After the journal/log is replayed we expect to not see the "foo3"
# link anymore and we should be able to remove all names in the
# directory "testdir" and then remove it (no stale directory entries
# left after the journal/log replay).
echo "Entries in testdir:"
ls -1 $SCRATCH_MNT/testdir
generic/107 3s ... - output mismatch (see .../results/generic/107.out.bad)
--- tests/generic/107.out 2015-08-01 01:39:45.807462161 +0100
+++ /home/fdmanana/git/hub/xfstests/results//generic/107.out.bad
@@ -1,3 +1,5 @@
QA output created by 107
Entries in testdir:
foo2
+foo3
+rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
...
_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent \
(see /home/fdmanana/git/hub/xfstests/results//generic/107.full)
_check_dmesg: something found in dmesg (see .../results/generic/107.dmesg)
Ran: generic/107
Failures: generic/107
Failed 1 of 1 tests
$ cat /home/fdmanana/git/hub/xfstests/results//generic/107.full
(...)
checking fs roots
root 5 inode 257 errors 200, dir isize wrong
unresolved ref dir 257 index 3 namelen 4 name foo3 filetype 1 errors 5, no dir item, no inode ref
(...)
So fix this by logging all parent inodes, current and old ones, to make
sure we do not get stale entries after log replay. This is not a simple
solution such as triggering a full transaction commit because it would
imply full transaction commit when an inode is fsynced in the same
transaction that modified it and reloaded it after eviction (because its
last_unlink_trans is set to the same value as its last_trans as of the
commit with the title "Btrfs: fix stale dir entries after unlink, inode
eviction and fsync"), and it would also make fstest generic/066 fail
since one of the fsyncs triggers a full commit and the next fsync will
not find the inode in the log anymore (therefore not removing the xattr).
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 18aa09229741364280d0a1670597b5207fc05b8d) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
If we delete a snapshot, fsync its parent directory and crash/power fail
before the next transaction commit, on the next mount when we attempt to
replay the log tree of the root containing the parent directory we will
fail and prevent the filesystem from mounting, which is solvable by wiping
out the log trees with the btrfs-zero-log tool but very inconvenient as
we will lose any data and metadata fsynced before the parent directory
was fsynced.
For example:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ mkdir /mnt/testdir
$ btrfs subvolume snapshot /mnt /mnt/testdir/snap
$ btrfs subvolume delete /mnt/testdir/snap
$ xfs_io -c "fsync" /mnt/testdir
< crash / power failure and reboot >
$ mount /dev/sdc /mnt
mount: mount(2) failed: No such file or directory
And in dmesg/syslog we get the following message and trace:
This happens because when we are replaying the log and processing the
directory entry pointing to the snapshot in the subvolume tree, we treat
its btrfs_dir_item item as having a location with a key type matching
BTRFS_INODE_ITEM_KEY, which is wrong because the type matches
BTRFS_ROOT_ITEM_KEY and therefore must be processed differently, as the
object id refers to a root number and not to an inode in the root
containing the parent directory.
So fix this by triggering a transaction commit if an fsync against the
parent directory is requested after deleting a snapshot. This is the
simplest approach for a rare use case. Some alternative that avoids the
transaction commit would require more code to explicitly delete the
snapshot at log replay time (factoring out common code from ioctl.c:
btrfs_ioctl_snap_destroy()), special care at fsync time to remove the
log tree of the snapshot's root from the log root of the root of tree
roots, amongst other steps.
A test case for xfstests that triggers the issue follows.
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
_cleanup_flakey
cd /
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_dm_target flakey
_require_metadata_journaling $SCRATCH_DEV
# Create a snapshot at the root of our filesystem (mount point path), delete it,
# fsync the mount point path, crash and mount to replay the log. This should
# succeed and after the filesystem is mounted the snapshot should not be visible
# anymore.
_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1
_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT
_flakey_drop_and_remount
[ -e $SCRATCH_MNT/snap1 ] && \
echo "Snapshot snap1 still exists after log replay"
# Similar scenario as above, but this time the snapshot is created inside a
# directory and not directly under the root (mount point path).
mkdir $SCRATCH_MNT/testdir
_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/testdir/snap2
_run_btrfs_util_prog subvolume delete $SCRATCH_MNT/testdir/snap2
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir
_flakey_drop_and_remount
[ -e $SCRATCH_MNT/testdir/snap2 ] && \
echo "Snapshot snap2 still exists after log replay"
_unmount_flakey
echo "Silence is golden"
status=0
exit
Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 1ec9a1ae1e30c733077c0b288c4301b66b7a81f2) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
If we create a symlink, fsync its parent directory, crash/power fail and
mount the filesystem, we end up with an empty symlink, which not only is
useless it's also not allowed in linux (the man page symlink(2) is well
explicit about that). So we just need to make sure to fully log an inode
if it's a symlink, to ensure its inline extent gets logged, ensuring the
same behaviour as ext3, ext4, xfs, reiserfs, f2fs, nilfs2, etc.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit 3f9749f6e9edcf8ec569fb542efc3be35e06e84a) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
If __key_link_begin() failed then "edit" would be uninitialized. I've
added a check to fix that.
This allows a random user to crash the kernel, though it's quite
difficult to achieve. There are three ways it can be done as the user
would have to cause an error to occur in __key_link():
(1) Cause the kernel to run out of memory. In practice, this is difficult
to achieve without ENOMEM cropping up elsewhere and aborting the
attempt.
(2) Revoke the destination keyring between the keyring ID being looked up
and it being tested for revocation. In practice, this is difficult to
time correctly because the KEYCTL_REJECT function can only be used
from the request-key upcall process. Further, users can only make use
of what's in /sbin/request-key.conf, though this does including a
rejection debugging test - which means that the destination keyring
has to be the caller's session keyring in practice.
(3) Have just enough key quota available to create a key, a new session
keyring for the upcall and a link in the session keyring, but not then
sufficient quota to create a link in the nominated destination keyring
so that it fails with EDQUOT.
The bug can be triggered using option (3) above using something like the
following:
echo 80 >/proc/sys/kernel/keys/root_maxbytes
keyctl request2 user debug:fred negate @t
The above sets the quota to something much lower (80) to make the bug
easier to trigger, but this is dependent on the system. Note also that
the name of the keyring created contains a random number that may be
between 1 and 10 characters in size, so may throw the test off by
changing the amount of quota used.
Assuming the failure occurs, something like the following will be seen:
Fixes: f70e2e06196a ('KEYS: Do preallocation for __key_link()') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David Howells <dhowells@redhat.com>
cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 38327424b40bcebe2de92d07312c89360ac9229a) Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Commit 2d8747c28478f85d9f04292780b1432edd2a384e ("blk-mq: avoid inserting
requests before establishing new mapping") introduced an extraneous call
to mutex_unlock(). The result is that the mutex gets unlocked twice.
Eric Dumazet [Sun, 10 Jul 2016 08:04:02 +0000 (10:04 +0200)]
tcp: make challenge acks less predictable
Yue Cao claims that current host rate limiting of challenge ACKS
(RFC 5961) could leak enough information to allow a patient attacker
to hijack TCP sessions. He will soon provide details in an academic
paper.
This patch increases the default limit from 100 to 1000, and adds
some randomization so that the attacker can no longer hijack
sessions without spending a considerable amount of probes.
Based on initial analysis and patch from Linus.
Note that we also have per socket rate limiting, so it is tempting
to remove the host limit in the future.
v2: randomize the count of challenge acks per second, not the period.
Fixes: 282f23c6ee34 ("tcp: implement RFC 5961 3.2") Reported-by: Yue Cao <ycao009@ucr.edu> Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 2401010
Conflicts:
net/ipv4/tcp_input.c Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Miklos Szeredi [Tue, 10 May 2016 23:16:37 +0000 (01:16 +0200)]
vfs: rename: check backing inode being equal
If a file is renamed to a hardlink of itself POSIX specifies that rename(2)
should do nothing and return success.
This condition is checked in vfs_rename(). However it won't detect hard
links on overlayfs where these are given separate inodes on the overlayfs
layer.
Overlayfs itself detects this condition and returns success without doing
anything, but then vfs_rename() will proceed as if this was a successful
rename (detach_mounts(), d_move()).
The correct thing to do is to detect this condition before even calling
into overlayfs. This patch does this by calling vfs_select_inode() to get
the underlying inodes.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: <stable@vger.kernel.org> # v4.2+
Orabug: 24363418
CVE:CVE-2016-6198,CVE-2016-6197
Same as mainline v4.6 commit 9409e22acdfc9153f88d9b1ed2bd2a5b34d2d3ca Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Unlink and rename in overlayfs checked the upper dentry for staleness by
verifying upper->d_parent against upperdir. However the dentry can go
stale also by being unhashed, for example.
Expand the verification to actually look up the name again (under parent
lock) and check if it matches the upper dentry. This matches what the VFS
does before passing the dentry to filesytem's unlink/rename methods, which
excludes any inconsistency caused by overlayfs.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Orabug: 24363418
CVE:CVE-2016-6198,CVE-2016-6197
Based on mainline v4.6 commit 11f3710417d026ea2f4fcf362d866342c5274185
Conflicts:
fs/overlayfs/dir.c - code base Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Rui Wang [Thu, 21 Jul 2016 20:21:39 +0000 (13:21 -0700)]
ovl: fix getcwd() failure after unsuccessful rmdir
ovl_remove_upper() should do d_drop() only after it successfully
removes the dir, otherwise a subsequent getcwd() system call will
fail, breaking userspace programs.
This is to fix: https://bugzilla.kernel.org/show_bug.cgi?id=110491
Signed-off-by: Rui Wang <rui.y.wang@intel.com> Reviewed-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Cc: <stable@vger.kernel.org>
Orabug: 24363418
CVE:CVE-2016-6198,CVE-2016-6197
Based on mainline v4.5 commit ce9113bbcbf45a57c082d6603b9a9f342be3ef74
Pre-req for mainline v4.6 commit 11f3710417d026ea2f4fcf362d866342c5274185
Conflicts:
fs/overlayfs/dir.c - code base Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
This commit made rolling upgrade fail. When one node is upgraded
to new version with this commit, the remaining nodes will fail to
establish connections to it, then vms on the remaining nodes can't
be live migrated to that node. This will cause an outage.
Orabug: 24292852 Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
When doing a direct IO write, __blockdev_direct_IO() can call the
btrfs_get_blocks_direct() callback one or more times before it calls the
btrfs_submit_direct() callback. However it can fail after calling the
first callback and before calling the second callback, which is a problem
because the first one creates ordered extents and the second one is the
one that submits bios that cover the ordered extents created by the first
one. That means the ordered extents will never complete nor have any of
the flags BTRFS_ORDERED_IO_DONE / BTRFS_ORDERED_IOERR set, resulting in
subsequent operations (such as other direct IO writes, buffered writes or
hole punching) that lock the same IO range and lookup for ordered extents
in the range to hang forever waiting for those ordered extents because
they can not complete ever, since no bio was submitted.
Fix this by tracking a range of created ordered extents that don't have
yet corresponding bios submitted and completing the ordered extents in
the range if __blockdev_direct_IO() fails with an error.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit f28a492878170f39002660a26c329201cf678d74) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Commit 61de718fceb6 ("Btrfs: fix memory corruption on failure to submit
bio for direct IO") fixed problems with the error handling code after we
fail to submit a bio for direct IO. However there were 2 problems that it
did not address when the failure is due to memory allocation failures for
direct IO writes:
1) We considered that there could be only one ordered extent for the whole
IO range, which is not always true, as we can have multiple;
2) It did not set the bit BTRFS_ORDERED_IO_DONE in the ordered extent,
which can make other tasks running btrfs_wait_logged_extents() hang
forever, since they wait for that bit to be set. The general assumption
is that regardless of an error, the BTRFS_ORDERED_IO_DONE is always set
and it precedes setting the bit BTRFS_ORDERED_COMPLETE.
Fix these issues by moving part of the btrfs_endio_direct_write() handler
into a new helper function and having that new helper function called when
we fail to allocate memory to submit the bio (and its private object) for
a direct IO write.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
(cherry picked from commit 14543774bd67a64f616431e5c9d1472f58979841) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
If we fail to submit a bio for a direct IO request, we were grabbing the
corresponding ordered extent and decrementing its reference count twice,
once for our lookup reference and once for the ordered tree reference.
This was a problem because it caused the ordered extent to be freed
without removing it from the ordered tree and any lists it might be
attached to, leaving dangling pointers to the ordered extent around.
Example trace with CONFIG_DEBUG_PAGEALLOC=y:
For read requests we weren't doing any cleanup either (none of the work
done by btrfs_endio_direct_read()), so a failure submitting a bio for a
read request would leave a range in the inode's io_tree locked forever,
blocking any future operations (both reads and writes) against that range.
So fix this by making sure we do the same cleanup that we do for the case
where the bio submission succeeds.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 61de718fceb6bc028dafe4d06a1f87a9e0998303) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
When doing a write using direct IO we can end up not doing the whole write
operation using the direct IO path, in that case we fallback to a buffered
write to do the remaining IO. This happens for example if the range we are
writing to contains a compressed extent.
When we do a partial write and fallback to buffered IO, due to the
existence of a compressed extent for example, we end up not adjusting the
outstanding extents counter of our inode which ends up getting decremented
twice, once by the DIO ordered extent for the partial write and once again
by btrfs_direct_IO(), resulting in an arithmetic underflow at
extent-tree.c:drop_outstanding_extent(). For example if we have:
extents [ prealloc extent ] [ compressed extent ]
offsets A B C D E
and at the moment our inode's outstanding extents counter is 0, if we do a
direct IO write against the range [B, D[ (which has a length smaller than
128Mb), we end up bumping our inode's outstanding extents counter to 1, we
create a DIO ordered extent for the range [B, C[ and then fallback to a
buffered write for the range [C, D[. The direct IO handler
(inode.c:btrfs_direct_IO()) decrements the outstanding extents counter by
1, leaving it with a value of 0, through a call to
btrfs_delalloc_release_space() and then shortly after the DIO ordered
extent finishes and calls btrfs_delalloc_release_metadata() which ends
up to attempt to decrement the inode's outstanding extents counter by 1,
resulting in an assertion failure at drop_outstanding_extent() because
the operation would result in an arithmetic underflow (0 - 1). This
produces the following trace:
So fix this by ensuring we adjust the outstanding extents counter when we
do the fallback just like we do for the case where the whole write can be
done through the direct IO path.
We were also adjusting the outstanding extents counter by a constant value
of 1, which is incorrect because we were ignorning that we account extents
in BTRFS_MAX_EXTENT_SIZE units, o fix that as well.
The following test case for fstests reproduces this issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_xfs_io_command "falloc"
# Create a compressed extent covering the range [700K, 800K[.
$XFS_IO_PROG -f -s -c "pwrite -S 0xaa -b 100K 700K 100K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Create prealloc extent covering the range [600K, 700K[.
$XFS_IO_PROG -c "falloc 600K 100K" $SCRATCH_MNT/foo
# Write 80K of data to the range [640K, 720K[ using direct IO. This
# range covers both the prealloc extent and the compressed extent.
# Because there's a compressed extent in the range we are writing to,
# the DIO write code path ends up only writing the first 60k of data,
# which goes to the prealloc extent, and then falls back to buffered IO
# for writing the remaining 20K of data - because that remaining data
# maps to a file range containing a compressed extent.
# When falling back to buffered IO, we used to trigger an assertion when
# releasing reserved space due to bad accounting of the inode's
# outstanding extents counter, which was set to 1 but we ended up
# decrementing it by 1 twice, once through the ordered extent for the
# 60K of data we wrote using direct IO, and once through the main direct
# IO handler (inode.cbtrfs_direct_IO()) because the direct IO write
# wrote less than 80K of data (60K).
$XFS_IO_PROG -d -c "pwrite -S 0xbb -b 80K 640K 80K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now similar test as above but for very large write operations. This
# triggers special cases for an inode's outstanding extents accounting,
# as internally btrfs logically splits extents into 128Mb units.
$XFS_IO_PROG -f -s \
-c "pwrite -S 0xaa -b 128M 258M 128M" \
-c "falloc 0 258M" \
$SCRATCH_MNT/bar | _filter_xfs_io
$XFS_IO_PROG -d -c "pwrite -S 0xbb -b 256M 3M 256M" $SCRATCH_MNT/bar \
| _filter_xfs_io
# Now verify the file contents are correct and that they are the same
# even after unmounting and mounting the fs again (or evicting the page
# cache).
#
# For file foo, all bytes in the range [0, 640K[ must have a value of
# 0x00, all bytes in the range [640K, 720K[ must have a value of 0xbb
# and all bytes in the range [720K, 800K[ must have a value of 0xaa.
#
# For file bar, all bytes in the range [0, 3M[ must havea value of 0x00,
# all bytes in the range [3M, 259M[ must have a value of 0xbb and all
# bytes in the range [259M, 386M[ must have a value of 0xaa.
#
echo "File digests before remounting the file system:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
_scratch_remount
echo "File digests after remounting the file system:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
status=0
exit
Fixes: e1cbbfa5f5aa ("Btrfs: fix outstanding_extents accounting in DIO") Fixes: 3e05bde8c3c2 ("Btrfs: only adjust outstanding_extents when we do a short write") Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit 9c9464cc92668984ebed79e22b5063877a8d97db) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
This means that the inode had non-zero "outstanding extents" during
eviction. This occurs because, during direct I/O a task which successfully
used up its reserved data space would set BTRFS_INODE_DIO_READY bit and does
not clear the bit after finishing the DIO write. A future DIO write could
actually fail and the unused reserve space won't be freed because of the
previously set BTRFS_INODE_DIO_READY bit.
Clearing the BTRFS_INODE_DIO_READY bit in btrfs_direct_IO() caused the
following issue,
|-----------------------------------+-------------------------------------|
| Task A | Task B |
|-----------------------------------+-------------------------------------|
| Start direct i/o write on inode X.| |
| reserve space | |
| Allocate ordered extent | |
| release reserved space | |
| Set BTRFS_INODE_DIO_READY bit. | |
| | splice() |
| | Transfer data from pipe buffer to |
| | destination file. |
| | - kmap(pipe buffer page) |
| | - Start direct i/o write on |
| | inode X. |
| | - reserve space |
| | - dio_refill_pages() |
| | - sdio->blocks_available == 0 |
| | - Since a kernel address is |
| | being passed instead of a |
| | user space address, |
| | iov_iter_get_pages() returns |
| | -EFAULT. |
| | - Since BTRFS_INODE_DIO_READY is |
| | set, we don't release reserved |
| | space. |
| | - Clear BTRFS_INODE_DIO_READY bit.|
| -EIOCBQUEUED is returned. | |
|-----------------------------------+-------------------------------------|
Hence this commit introduces "struct btrfs_dio_data" to track the usage of
reserved data space. The remaining unused "reserve space" can now be freed
reliably.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 50745b0a7f46f68574cd2b9ae24566bf026e7ebd) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
While running generic/019, dmesg got several warnings from
btrfs_free_reserved_data_space().
Test generic/019 produces some disk failures so sumbit dio will get errors,
in which case, btrfs_direct_IO() goes to the error handling and free
bytes_may_use, but the problem is that bytes_may_use has been free'd
during get_block().
This adds a runtime flag to show if we've gone through get_block(), if so,
don't do the cleanup work.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit ddba1bfc2369cd0566bcfdab47599834a32d1c19) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Commit 21d62f6 ("blk-mq: fix race between timeout and freeing request")
added the element "orig_rq" to struct blk_flush_queue. This broke kABI.
The existing blk_flush_queue structure has "holes" (padding) of 4 bytes
plus 29 bits. This is not enough space to contain orig_rq, which is 8
bytes long.
Currently, this data structure is only used by the block-device code.
It is unlikely that a third-party module (e.g., a device driver) would
use struct blk_flush_queue, and even if one did, it would most likely
be via a pointer.
This commit moves orig_rq to the end of the structure, so that the offsets
of the existing elements remain the same as before. It also wraps orig_rq
in "#ifndef __GENKSYMS__" to hide it from the kABI checker.
Inside timeout handler, blk_mq_tag_to_rq() is called
to retrieve the request from one tag. This way is obviously
wrong because the request can be freed any time and some
fiedds of the request can't be trusted, then kernel oops
might be triggered[1].
Currently wrt. blk_mq_tag_to_rq(), the only special case is
that the flush request can share same tag with the request
cloned from, and the two requests can't be active at the same
time, so this patch fixes the above issue by updating tags->rqs[tag]
with the active request(either flush rq or the request cloned
from) of the tag.
Also blk_mq_tag_to_rq() gets much simplified with this patch.
Given blk_mq_tag_to_rq() is mainly for drivers and the caller must
make sure the request can't be freed, so in bt_for_each() this
helper is replaced with tags->rqs[tag].
Commit e92f419 ("blk-mq: Shared tag enhancements") added the element
"cpumask" to struct blk_mq_tags. This broke kABI.
Currently, this structure is only used by the core blk-mq code.
I don't expect that third-party modules will use it at all.
Even if they do, though, the kernel provides an API that
encapsulates creation/destruction/access for these structures,
and modules that use that API will be unaffected by the change.
So the workaround for the kABI breakage is simply to hide the new
element from the kABI checker by wrapping it in #ifndef __GENKSYMS__.
Storage controllers may expose multiple block devices that share hardware
resources managed by blk-mq. This patch enhances the shared tags so a
low-level driver can access the shared resources not tied to the unshared
h/w contexts. This way the LLD can dynamically add and delete disks and
request queues without having to track all the request_queue hctx's to
iterate outstanding tags.
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit f26cdc8536ad50fb802a0445f836b4f94ca09ae7) Signed-off-by: Dan Duval <dan.duval@oracle.com>
This oops happens with the namespace_sem held and can be triggered by
non-root users. An all around not pleasant experience.
To avoid this scenario when finding the appropriate source mount to
copy stop the walk up the mnt_master chain when the first source mount
is encountered.
Further rewrite the walk up the last_source mnt_master chain so that
it is clear what is going on.
The reason why the first source mount is special is that it it's
mnt_parent is not a mount in the dest_mnt propagation tree, and as
such termination conditions based up on the dest_mnt mount propgation
tree do not make sense.
To avoid other kinds of confusion last_dest is not changed when
computing last_source. last_dest is only used once in propagate_one
and that is above the point of the code being modified, so changing
the global variable is meaningless and confusing.
Maxim Patlasov [Tue, 16 Feb 2016 19:45:33 +0000 (11:45 -0800)]
fs/pnode.c: treat zero mnt_group_id-s as unequal
propagate_one(m) calculates "type" argument for copy_tree() like this:
> if (m->mnt_group_id == last_dest->mnt_group_id) {
> type = CL_MAKE_SHARED;
> } else {
> type = CL_SLAVE;
> if (IS_MNT_SHARED(m))
> type |= CL_MAKE_SHARED;
> }
The "type" argument then governs clone_mnt() behavior with respect to flags
and mnt_master of new mount. When we iterate through a slave group, it is
possible that both current "m" and "last_dest" are not shared (although,
both are slaves, i.e. have non-NULL mnt_master-s). Then the comparison
above erroneously makes new mount shared and sets its mnt_master to
last_source->mnt_master. The patch fixes the problem by handling zero
mnt_group_id-s as though they are unequal.
The similar problem exists in the implementation of "else" clause above
when we have to ascend upward in the master/slave tree by calling:
proper number of times. The last step is governed by
"n->mnt_group_id != last_dest->mnt_group_id" condition that may lie if
both are zero. The patch fixes this case in the same way as the former one.
[AV: don't open-code an obvious helper...]
Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 7ae8fd0351f912b075149a1e03a017be8b903b9a)
VSOCK: Only check error on skb_recv_datagram when skb is NULL
If skb_recv_datagram returns an skb, we should ignore the err
value returned. Otherwise, datagram receives will return EAGAIN
when they have to wait for a datagram.
Acked-by: Adit Ranadive <aditr@vmware.com> Signed-off-by: Jorgen Hansen <jhansen@vmware.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 9c995cc9a206a008699da82f6cd01e9b2615649a) Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 79acddbe36a34e54b558ca7f30938e75e005863b)
VSOCK: Detach QP check should filter out non matching QPs.
The check in vmci_transport_peer_detach_cb should only allow a
detach when the qp handle of the transport matches the one in
the detach message.
Testing: Before this change, a detach from a peer on a different
socket would cause an active stream socket to register a detach.
Reviewed-by: George Zhang <georgezhang@vmware.com> Signed-off-by: Jorgen Hansen <jhansen@vmware.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 8ab18d71de8b07d2c4d6f984b718418c09ea45c5) Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 3cab04cb1262c0f6119e95a998d8f0ce3b0dd09f)
Intel's MCA implementation broadcasts MCEs to all CPUs on the
node. This poses a problem for offlined CPUs which cannot
participate in the rendezvous process:
Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
Kernel Offset: disabled
Rebooting in 100 seconds..
More specifically, Linux does a soft offline of a CPU when
writing a 0 to /sys/devices/system/cpu/cpuX/online, which
doesn't prevent the #MC exception from being broadcasted to that
CPU.
Ensure that offline CPUs don't participate in the MCE rendezvous
and clear the RIP valid status bit so that a second MCE won't
cause a shutdown.
Without the patch, mce_start() will increment mce_callin and
wait for all CPUs. Offlined CPUs should avoid participating in
the rendezvous process altogether.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
[ Massage commit message. ] Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1449742346-21470-2-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Richard Alpe [Mon, 16 May 2016 09:14:54 +0000 (11:14 +0200)]
tipc: check nl sock before parsing nested attributes
Make sure the socket for which the user is listing publication exists
before parsing the socket netlink attributes.
Prior to this patch a call without any socket caused a NULL pointer
dereference in tipc_nl_publ_dump().
Tested-and-reported-by: Baozeng Ding <sploving1@gmail.com> Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.cm> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 45e093ae2830cd1264677d47ff9a95a71f5d9f9c)
Payloads of NM entries are not supposed to contain NUL. When we run
into such, only the part prior to the first NUL goes into the
concatenation (i.e. the directory entry name being encoded by a bunch
of NM entries). We do stop when the amount collected so far + the
claimed amount in the current NM entry exceed 254. So far, so good,
but what we return as the total length is the sum of *claimed*
sizes, not the actual amount collected. And that can grow pretty
large - not unlimited, since you'd need to put CE entries in
between to be able to get more than the maximum that could be
contained in one isofs directory entry / continuation chunk and
we are stop once we'd encountered 32 CEs, but you can get about 8Kb
easily. And that's what will be passed to readdir callback as the
name length. 8Kb __copy_to_user() from a buffer allocated by
__get_free_page()
Cc: stable@vger.kernel.org # 0.98pl6+ (yes, really) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 99d825822eade8d827a1817357cbf3f889a552d6)
If a directory inode is orphanized, because some inode previously
processed has a new name that collides with the old name of the current
inode, we need to check if it needs its rename operation delayed too,
as its ancestor-descendent relationship with some other inode might
have been reversed between the parent and send snapshots and therefore
its rename operation needs to happen after that other inode is renamed.
For example, for the following reproducer where this is needed (provided
by Robbie Ko):
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt2
While processing inode 262 we orphanize inode 264 and later attempt
to rename inode 264 to its new name/location, which resulted in building
an incorrect destination path string for the rename operation with the
value "data/n4/t2/t7/t4/t5/t6/n2/t7/t3/t7". This rename operation must
have been done only after inode 267 is processed and renamed, as the
ancestor-descendent relationship between inodes 264 and 267 was reversed
between both snapshots, because otherwise it results in an infinite loop
when building the path string for inode 264 when we are processing an
inode with a number larger than 264. That loop is the following:
start inode 264, send progress of 265 for example
parent of 264 -> 267
parent of 267 -> 262
parent of 262 -> 259
parent of 259 -> 261
parent of 261 -> 263
parent of 263 -> 266
parent of 266 -> 264
|--> back to first iteration while current path string length
is <= PATH_MAX, and fail with -ENOMEM otherwise
So fix this by making the check if we need to delay a directory rename
regardless of the current inode having been orphanized or not.
A test case for fstests follows soon.
Thanks to Robbie Ko for providing a reproducer for this problem.
Reported-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit 8b191a684968e24b34c9894024b37532c68e6ae8) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Chuck Lever [Mon, 2 May 2016 18:40:31 +0000 (14:40 -0400)]
sunrpc: Update RPCBIND_MAXNETIDLEN
Commit 176e21ee2ec8 ("SUNRPC: Support for RPC over AF_LOCAL
transports") added a 5-character netid, but did not bump
RPCBIND_MAXNETIDLEN from 4 to 5.
Fixes: 176e21ee2ec8 ("SUNRPC: Support for RPC over AF_LOCAL ...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
(cherry picked from commit 4b9c7f9db9a003f5c342184dc4401c1b7f2efb39) Signed-off-by: Todd Vierling <todd.vierling@oracle.com>
Orabug: 22695683
Shirley Ma [Wed, 4 May 2016 14:52:47 +0000 (10:52 -0400)]
svcrdma: Support IPv6 with NFS/RDMA
Allow both IPv4 and IPv6 to bind same port at the same time,
restricts use of the IPv6 socket to IPv6 communication.
Changes from v1:
- Check rdma_set_afonly return value (suggested by Leon Romanovsky)
Changes from v2:
- Acked-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Shirley Ma <shirley.ma@oracle.com> Acked-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
(cherry picked from commit 696190eaf168e83fcbf0c9cf087995cdd15807e5) Signed-off-by: Todd Vierling <todd.vierling@oracle.com>
Orabug: 22695683
Shirley Ma [Mon, 2 May 2016 18:40:23 +0000 (14:40 -0400)]
xprtrdma: Add rdma6 option to support NFS/RDMA IPv6
RFC 5666: The "rdma" netid is to be used when IPv4 addressing
is employed by the underlying transport, and "rdma6" for IPv6
addressing.
Add mount -o proto=rdma6 option to support NFS/RDMA IPv6 addressing.
Changes from v2:
- Integrated comments from Chuck Level, Anna Schumaker, Trodt Myklebust
- Add a little more to the patch description to describe NFS/RDMA
IPv6 suggested by Chuck Level and Anna Schumaker
- Removed duplicated rdma6 define
- Remove Opt_xprt_rdma mountfamily since it doesn't support
Signed-off-by: Shirley Ma <shirley.ma@oracle.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
(cherry picked from commit 181342c5ebe8cc31f75b80ace18ae8a89a0c145a) Signed-off-by: Todd Vierling <todd.vierling@oracle.com>
Orabug: 22695683
Keith Busch [Fri, 18 Dec 2015 00:08:14 +0000 (17:08 -0700)]
blk-mq: dynamic h/w context count
The hardware's provided queue count may change at runtime with resource
provisioning. This patch allows a block driver to alter the number of
h/w queues available when its resource count changes.
The main part is a new blk-mq API to request a new number of h/w queues
for a given live tag set. The new API freezes all queues using that set,
then adjusts the allocated count prior to remapping these to CPUs.
The bulk of the rest just shifts where h/w contexts and all their
artifacts are allocated and freed.
The number of max h/w contexts is capped to the number of possible cpus
since there is no use for more than that. As such, all pre-allocated
memory for pointers need to account for the max possible rather than
the initial number of queues.
A side effect of this is that the blk-mq will proceed successfully as
long as it can allocate at least one h/w context. Previously it would
fail request queue initialization if less than the requested number
was allocated.
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Orabug: 23340426
(cherry picked from commit 868f2f0b72068a097508b6e8870a8950fd8eb7ef) Signed-off-by: Bob Liu <bob.liu@oracle.com>
Akinobu Mita [Sat, 26 Sep 2015 17:09:23 +0000 (02:09 +0900)]
blk-mq: avoid inserting requests before establishing new mapping
Notifier callbacks for CPU_ONLINE action can be run on the other CPU
than the CPU which was just onlined. So it is possible for the
process running on the just onlined CPU to insert request and run
hw queue before establishing new mapping which is done by
blk_mq_queue_reinit_notify().
This can cause a problem when the CPU has just been onlined first time
since the request queue was initialized. At this time ctx->index_hw
for the CPU, which is the index in hctx->ctxs[] for this ctx, is still
zero before blk_mq_queue_reinit_notify() is called by notifier
callbacks for CPU_ONLINE action.
For example, there is a single hw queue (hctx) and two CPU queues
(ctx0 for CPU0, and ctx1 for CPU1). Now CPU1 is just onlined and
a request is inserted into ctx1->rq_list and set bit0 in pending
bitmap as ctx1->index_hw is still zero.
And then while running hw queue, flush_busy_ctxs() finds bit0 is set
in pending bitmap and tries to retrieve requests in
hctx->ctxs[0]->rq_list. But htx->ctxs[0] is a pointer to ctx0, so the
request in ctx1->rq_list is ignored.
Fix it by ensuring that new mapping is established before onlined cpu
starts running.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Orabug: 23340426
(cherry picked from commit 5778322e67ed34dc9f391a4a5cbcbb856071ceba) Signed-off-by: Bob Liu <bob.liu@oracle.com>
Willy Tarreau [Mon, 18 Jan 2016 15:36:09 +0000 (16:36 +0100)]
pipe: limit the per-user amount of pages allocated in pipes
Add fix for Kabi breakage for pipe: limit the per-user amount of pages
allocated in pipes
On no-so-small systems, it is possible for a single process to cause an
OOM condition by filling large pipes with data that are never read. A
typical process filling 4000 pipes with 1 MB of data will use 4 GB of
memory. On small systems it may be tricky to set the pipe max size to
prevent this from happening.
This patch makes it possible to enforce a per-user soft limit above
which new pipes will be limited to a single page, effectively limiting
them to 4 kB each, as well as a hard limit above which no new pipes may
be created for this user. This has the effect of protecting the system
against memory abuse without hurting other users, and still allowing
pipes to work correctly though with less data at once.
The limit are controlled by two new sysctls : pipe-user-pages-soft, and
pipe-user-pages-hard. Both may be disabled by setting them to zero. The
default soft limit allows the default number of FDs per process (1024)
to create pipes of the default size (64kB), thus reaching a limit of 64MB
before starting to create only smaller pipes. With 256 processes limited
to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
1084 MB of memory allocated for a user. The hard limit is disabled by
default to avoid breaking existing applications that make intensive use
of pipes (eg: for splicing).
Need pre-req commit 0c8ecee92c2241adf9ca910aa42ef0112913b817
add user_struct element unix_inflight pre-req for "pipe: limit the per-user
due to a merge conflict with master: c057264c4ba7ddf9d775c013fcb1cecb910bf2b9
fix kABI breakage from "unix: properly account for FDs passed over unix sock Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Chuck Anderson [Fri, 10 Jun 2016 20:53:15 +0000 (13:53 -0700)]
add user_struct element unix_inflight pre-req for "pipe: limit the per-user amount of pages allocated in pipes"
The commit: pipe: limit the per-user amount of pages allocated in pipes
caused a merge conflict with uek/uek-4.1 due to: c057264c4ba7ddf9d775c013fcb1cecb910bf2b9
fix kABI breakage from "unix: properly account for FDs passed over unix sock
Add include/linux/sched.h user_struct change from c057264c so that the
code context matches uek/uek-4.1
Orabug: 22901731 Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Xunlei Pang [Wed, 2 Dec 2015 11:52:59 +0000 (19:52 +0800)]
sched/core: Clear the root_domain cpumasks in init_rootdomain()
root_domain::rto_mask allocated through alloc_cpumask_var()
contains garbage data, this may cause problems. For instance,
When doing pull_rt_task(), it may do useless iterations if
rto_mask retains some extra garbage bits. Worse still, this
violates the isolated domain rule for clustered scheduling
using cpuset, because the tasks(with all the cpus allowed)
belongs to one root domain can be pulled away into another
root domain.
The patch cleans the garbage by using zalloc_cpumask_var()
instead of alloc_cpumask_var() for root_domain::rto_mask
allocation, thereby addressing the issues.
Do the same thing for root_domain's other cpumask memembers:
dlo_mask, span, and online.
Signed-off-by: Xunlei Pang <xlpang@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1449057179-29321-1-git-send-email-xlpang@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Orabug: 23307036
(cherry picked from commit 8295c69925ad53ec32ca54ac9fc194ff21bc40e2) Signed-off-by: Todd Vierling <todd.vierling@oracle.com>
Two new messages are added to support negotiating hb timeout. Stop
nodes frmo talking an old version to mount as they will cause the
negotiation to fail.
Hannes Reinecke [Thu, 26 Nov 2015 07:46:57 +0000 (08:46 +0100)]
block: Always check queue limits for cloned requests
When a cloned request is retried on other queues it always needs
to be checked against the queue limits of that queue.
Otherwise the calculations for nr_phys_segments might be wrong,
leading to a crash in scsi_init_sgtable().
To clarify this the patch renames blk_rq_check_limits()
to blk_cloned_rq_check_limits() and removes the symbol
export, as the new function should only be used for
cloned requests and never exported.
Cc: Mike Snitzer <snitzer@redhat.com> Cc: Ewan Milne <emilne@redhat.com> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Fixes: e2a60da74 ("block: Clean up special command handling logic") Cc: stable@vger.kernel.org # 3.7+ Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit bf4e6b4e757488dee1b6a581f49c7ac34cd217f8)
Orabug: 23140620 Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>
We should be doing this, it's weird we hadn't been doing this.
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 0d2b2372e097cd3b4150d3ec91e79ac3c5cc750e) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
If we remove a hard link from an inode, the inode gets evicted, then
we fsync the inode and then power fail/crash, when the log tree is
replayed, the parent directory inode still has entries pointing to
the name that no longer exists, while our inode no longer has the
BTRFS_INODE_REF_KEY item matching the deleted hard link (as expected),
leaving the filesystem in an inconsistent state. The stale directory
entries can not be deleted (an attempt to delete them causes -ESTALE
errors), which makes it impossible to delete the parent directory.
This happens because we track the id of the transaction where the last
unlink operation for the inode happened (last_unlink_trans) in an
in-memory only field of the inode, that is, a value that is never
persisted in the inode item stored on the fs/subvol btree. So if an
inode is evicted and loaded again, the value for last_unlink_trans is
set to 0, which prevents the fsync from logging the parent directory
at btrfs_log_inode_parent(). So fix this by setting last_unlink_trans
to the id of the transaction that last modified the inode when we
load the inode. This is a pessimistic approach but it always ensures
correctness with the trade off of ocassional full transaction commits
when an fsync is done against the inode in the same transaction where
it was evicted and reloaded when our inode is a directory and often
logging its parent unnecessarily when our inode is not a directory.
The following test case for fstests triggers the problem:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
_cleanup_flakey
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_need_to_be_root
_supported_fs generic
_supported_os Linux
_require_scratch
_require_dm_flakey
_require_metadata_journaling $SCRATCH_DEV
# Create our test file with 2 hard links.
mkdir $SCRATCH_MNT/testdir
touch $SCRATCH_MNT/testdir/foo
ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/bar
# Make sure everything done so far is durably persisted.
sync
# Now remove one of the links, trigger inode eviction and then fsync
# our inode.
unlink $SCRATCH_MNT/testdir/bar
echo 2 > /proc/sys/vm/drop_caches
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/foo
# Silently drop all writes on our scratch device to simulate a power failure.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again and mount the fs to trigger log/journal replay.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# Now verify our directory entries.
echo "Entries in testdir:"
ls -1 $SCRATCH_MNT/testdir
# If we remove our inode, its parent should become empty and therefore we should
# be able to remove the parent.
rm -f $SCRATCH_MNT/testdir/*
rmdir $SCRATCH_MNT/testdir
_unmount_flakey
# The fstests framework will call fsck against our filesystem which will verify
# that all metadata is in a consistent state.
status=0
exit
The test failed on btrfs with:
generic/098 4s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad)
--- tests/generic/098.out 2015-07-23 18:01:12.616175932 +0100
+++ /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad 2015-07-23 18:04:58.924138308 +0100
@@ -1,3 +1,6 @@
QA output created by 098
Entries in testdir:
+bar
foo
+rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo': Stale file handle
+rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
...
(Run 'diff -u tests/generic/098.out /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad' to see the entire diff)
_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent (see /home/fdmanana/git/hub/xfstests/results//generic/098.full)
$ cat /home/fdmanana/git/hub/xfstests/results//generic/098.full
(...)
checking fs roots
root 5 inode 258 errors 2001, no inode item, link count wrong
unresolved ref dir 257 index 0 namelen 3 name foo filetype 1 errors 6, no dir index, no inode ref
unresolved ref dir 257 index 3 namelen 3 name bar filetype 1 errors 5, no dir item, no inode ref
Checking filesystem on /dev/sdc
(...)
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit bde6c242027b0f1d697d5333950b3a05761d40e4) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Andrey Ryabinin [Wed, 7 Oct 2015 11:39:55 +0000 (14:39 +0300)]
lockd: get rid of reference-counted NSM RPC clients
Currently we have reference-counted per-net NSM RPC client
which created on the first monitor request and destroyed
after the last unmonitor request. It's needed because
RPC client need to know 'utsname()->nodename', but utsname()
might be NULL when nsm_unmonitor() called.
So instead of holding the rpc client we could just save nodename
in struct nlm_host and pass it to the rpc_create().
Thus ther is no need in keeping rpc client until last
unmonitor request. We could create separate RPC clients
for each monitor/unmonitor requests.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
(cherry picked from commit 0d0f4aab4e4d290138a4ae7f2ef8469e48c9a669)
Commit cb7323fffa85 ("lockd: create and use per-net NSM
RPC clients on MON/UNMON requests") introduced per-net
NSM RPC clients. Unfortunately this doesn't make any sense
without per-net nsm_handle.
E.g. the following scenario could happen
Two hosts (X and Y) in different namespaces (A and B) share
the same nsm struct.
1. nsm_monitor(host_X) called => NSM rpc client created,
nsm->sm_monitored bit set.
2. nsm_mointor(host-Y) called => nsm->sm_monitored already set,
we just exit. Thus in namespace B ln->nsm_clnt == NULL.
3. host X destroyed => nsm->sm_count decremented to 1
4. host Y destroyed => nsm_unmonitor() => nsm_mon_unmon() => NULL-ptr
dereference of *ln->nsm_clnt
So this could be fixed by making per-net nsm_handles list,
instead of global. Thus different net namespaces will not be able
share the same nsm_handle.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
(cherry picked from commit 0ad95472bf169a3501991f8f33f5147f792a8116)
Keith Busch [Mon, 24 Aug 2015 13:48:16 +0000 (08:48 -0500)]
PCI: Set MPS to match upstream bridge
Firmware typically configures the PCIe fabric with a consistent Max Payload
Size setting based on the devices present at boot. A hot-added device
typically has the power-on default MPS setting (128 bytes), which may not
match the fabric.
The previous Linux default, in the absence of any "pci=pcie_bus_*" options,
was PCIE_BUS_TUNE_OFF, in which we never touch MPS, even for hot-added
devices.
Add a new default setting, PCIE_BUS_DEFAULT, in which we make sure every
device's MPS setting matches the upstream bridge. This makes it more
likely that a hot-added device will work in a system with optimized MPS
configuration.
Note that if we hot-add a device that only supports 128-byte MPS, it still
likely won't work because we don't reconfigure the rest of the fabric.
Booting with "pci=pcie_bus_peer2peer" is a workaround for this because it
sets MPS to 128 for everything.
[bhelgaas: changelog, new default, rework for pci_configure_device() path] Tested-by: Keith Busch <keith.busch@intel.com> Tested-by: Jordan Hargrave <jharg93@gmail.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Yinghai Lu <yinghai@kernel.org>
(cherry picked from commit 27d868b5e6cfaee4fec66b388e4085ff94050fa7)
Bjorn Helgaas [Thu, 20 Aug 2015 21:08:27 +0000 (16:08 -0500)]
PCI: Move MPS configuration check to pci_configure_device()
Previously we checked for invalid MPS settings, i.e., a device with MPS
different than its upstream bridge, in pcie_bus_detect_mps(). We only did
this if the arch or hotplug driver called pcie_bus_configure_settings(),
and then only if PCIe bus tuning was disabled (PCIE_BUS_TUNE_OFF).
Move the MPS checking code to pci_configure_device(), so we do it in the
pci_device_add() path for every device.
In the ASN.1 decoder, when the length field of an ASN.1 value is extracted,
it isn't validated against the remaining amount of data before being added
to the cursor. With a sufficiently large size indicated, the check:
datalen - dp < 2
may then fail due to integer overflow.
Fix this by checking the length indicated against the amount of remaining
data in both places a definite length is determined.
Whilst we're at it, make the following changes:
(1) Check the maximum size of extended length does not exceed the capacity
of the variable it's being stored in (len) rather than the type that
variable is assumed to be (size_t).
(2) Compare the EOC tag to the symbolic constant ASN1_EOC rather than the
integer 0.
(3) To reduce confusion, move the initialisation of len outside of:
for (len = 0; n > 0; n--) {
since it doesn't have anything to do with the loop counter n.
Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Acked-by: Peter Jones <pjones@redhat.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Peter Hurley [Mon, 11 Jan 2016 06:40:55 +0000 (22:40 -0800)]
tty: Fix unsafe ldisc reference via ioctl(TIOCGETD)
ioctl(TIOCGETD) retrieves the line discipline id directly from the
ldisc because the line discipline id (c_line) in termios is untrustworthy;
userspace may have set termios via ioctl(TCSETS*) without actually
changing the line discipline via ioctl(TIOCSETD).
However, directly accessing the current ldisc via tty->ldisc is
unsafe; the ldisc ptr dereferenced may be stale if the line discipline
is changing via ioctl(TIOCSETD) or hangup.
Wait for the line discipline reference (just like read() or write())
to retrieve the "current" line discipline id.
Cc: <stable@vger.kernel.org> Signed-off-by: Peter Hurley <peter@hurleysoftware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 5c17c861a357e9458001f021a7afa7aab9937439)
Alan Stern [Wed, 16 Dec 2015 18:32:38 +0000 (13:32 -0500)]
USB: fix invalid memory access in hub_activate()
Commit 8520f38099cc ("USB: change hub initialization sleeps to
delayed_work") changed the hub_activate() routine to make part of it
run in a workqueue. However, the commit failed to take a reference to
the usb_hub structure or to lock the hub interface while doing so. As
a result, if a hub is plugged in and quickly unplugged before the work
routine can run, the routine will try to access memory that has been
deallocated. Or, if the hub is unplugged while the routine is
running, the memory may be deallocated while it is in active use.
This patch fixes the problem by taking a reference to the usb_hub at
the start of hub_activate() and releasing it at the end (when the work
is finished), and by locking the hub interface while the work routine
is running. It also adds a check at the start of the routine to see
if the hub has already been disconnected, in which nothing should be
done.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: Alexandru Cornea <alexandru.cornea@intel.com> Tested-by: Alexandru Cornea <alexandru.cornea@intel.com> Fixes: 8520f38099cc ("USB: change hub initialization sleeps to delayed_work") CC: <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit e50293ef9775c5f1cf3fcc093037dd6a8c5684ea)
Commit 8b13eddfdf04cbfa561725cfc42d6868fe896f56 ("netfilter: refactor NAT
redirect IPv4 to use it from nf_tables") has introduced a trivial logic
change which can result in the following crash.
unsigned int
nf_nat_redirect_ipv4(struct sk_buff *skb,
...
{
...
rcu_read_lock();
indev = __in_dev_get_rcu(skb->dev);
if (indev != NULL) {
ifa = indev->ifa_list;
newdst = ifa->ifa_local; <---
}
rcu_read_unlock();
...
}
Before the commit, 'ifa' had been always checked before access. After the
commit, however, it could be accessed even if it's NULL. Interestingly,
this was once fixed in 2003.
In addition to the original one, we have seen the crash when packets that
need to be redirected somehow arrive on an interface which hasn't been
yet fully configured.
This change just reverts the logic to the old behavior to avoid the crash.
Fixes: 8b13eddfdf04 ("netfilter: refactor NAT redirect IPv4 to use it from nf_tables") Signed-off-by: Munehisa Kamata <kamatam@amazon.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit 94f9cd81436c85d8c3a318ba92e236ede73752fc)
Andy Lutomirski [Wed, 6 Jan 2016 20:21:01 +0000 (12:21 -0800)]
x86/mm: Add barriers and document switch_mm()-vs-flush synchronization
When switch_mm() activates a new PGD, it also sets a bit that
tells other CPUs that the PGD is in use so that TLB flush IPIs
will be sent. In order for that to work correctly, the bit
needs to be visible prior to loading the PGD and therefore
starting to fill the local TLB.
Document all the barriers that make this work correctly and add
a couple that were missing.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Cc: stable@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 71b3c126e61177eb693423f2e18a1914205b165e)
A case can occur when sctp_accept() is called by the user during
a heartbeat timeout event after the 4-way handshake. Since
sctp_assoc_migrate() changes both assoc->base.sk and assoc->ep, the
bh_sock_lock in sctp_generate_heartbeat_event() will be taken with
the listening socket but released with the new association socket.
The result is a deadlock on any future attempts to take the listening
socket lock.
Note that this race can occur with other SCTP timeouts that take
the bh_lock_sock() in the event sctp_accept() is called.
=====================================
[ BUG: bad unlock balance detected! ]
-------------------------------------
CslRx/12087 is trying to release lock (slock-AF_INET) at:
[<ffffffffa01bcae0>] sctp_generate_timeout_event+0x40/0xe0 [sctp]
but there are no more locks to release!
other info that might help us debug this:
2 locks held by CslRx/12087:
#0: (&asoc->timers[i]){+.-...}, at: [<ffffffff8108ce1f>] run_timer_softirq+0x16f/0x3e0
#1: (slock-AF_INET){+.-...}, at: [<ffffffffa01bcac3>] sctp_generate_timeout_event+0x23/0xe0 [sctp]
Ensure the socket taken is also the same one that is released by
saving a copy of the socket before entering the timeout event
critical section.
Signed-off-by: Karl Heiss <kheiss@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 635682a14427d241bab7bbdeebb48a7d7b91638e) Signed-off-by: Brian Maly <brian.maly@oracle.com> Acked-by: Chuck Anderson <chuck.anderson@oracle.com>
Conflicts:
net/sctp/sm_sideeffect.c
Alexey Dobriyan [Thu, 25 Jun 2015 22:00:54 +0000 (15:00 -0700)]
proc: fix PAGE_SIZE limit of /proc/$PID/cmdline
/proc/$PID/cmdline truncates output at PAGE_SIZE. It is easy to see with
$ cat /proc/self/cmdline $(seq 1037) 2>/dev/null
However, command line size was never limited to PAGE_SIZE but to 128 KB
and relatively recently limitation was removed altogether.
People noticed and ask questions:
http://stackoverflow.com/questions/199130/how-do-i-increase-the-proc-pid-cmdline-4096-byte-limit
seq file interface is not OK, because it kmalloc's for whole output and
open + read(, 1) + sleep will pin arbitrary amounts of kernel memory. To
not do that, limit must be imposed which is incompatible with arbitrary
sized command lines.
I apologize for hairy code, but this it direct consequence of command line
layout in memory and hacks to support things like "init [3]".
The loops are "unrolled" otherwise it is either macros which hide control
flow or functions with 7-8 arguments with equal line count.
There should be real setproctitle(2) or something.
[akpm@linux-foundation.org: fix a billion min() warnings] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Tested-by: Jarod Wilson <jarod@redhat.com> Acked-by: Jarod Wilson <jarod@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jan Stancek <jstancek@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 23093364
Mainline commit c2c0bb44620dece7ec97e28167e32c343da22867 Acked-by: Chuck Anderson <chuck.anderson@oracle.com>
The problem comes with the fact that many such jobs (for the same device)
are being scheduled simultaneously. While scsi_remove_device() uses
shost->scan_mutex and scsi_device_lookup() will fail for a device in
SDEV_DEL state there is no protection against someone who did
scsi_device_lookup() before we actually entered __scsi_remove_device(). So
the whole scenario looks like that: two callers do simultaneous (or
preemption happens) calls to scsi_device_lookup() ant these calls succeed
for both of them, after that they try doing scsi_remove_device().
shost->scan_mutex only serializes their calls to __scsi_remove_device()
and we end up doing the cleanup path twice.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit be821fd8e62765de43cc4f0e2db363d0e30a7e9b)
Orabug: 23021563 Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>
David S. Miller [Mon, 14 Mar 2016 03:28:00 +0000 (23:28 -0400)]
ipv4: Don't do expensive useless work during inetdev destroy.
When an inetdev is destroyed, every address assigned to the interface
is removed. And in this scenerio we do two pointless things which can
be very expensive if the number of assigned interfaces is large:
1) Address promotion. We are deleting all addresses, so there is no
point in doing this.
2) A full nf conntrack table purge for every address. We only need to
do this once, as is already caught by the existing
masq_dev_notifier so masq_inet_event() can skip this.
Reported-by: Solar Designer <solar@openwall.com> Signed-off-by: David S. Miller <davem@davemloft.net> Tested-by: Cyrill Gorcunov <gorcunov@openvz.org>
(cherry picked from commit fbd40ea0180a2d328c5adc61414dc8bab9335ce2)
Filipe Manana [Sat, 20 Jun 2015 17:20:09 +0000 (18:20 +0100)]
Btrfs: fix shrinking truncate when the no_holes feature is enabled
If the no_holes feature is enabled, we attempt to shrink a file to a size
that ends up in the middle of a hole and we don't have any file extent
items in the fs/subvol tree that go beyond the new file size (or any
ordered extents that will insert such file extent items), we end up not
updating the inode's disk_i_size, we only update the inode's i_size.
This means that after unmounting and mounting the filesystem, or after
the inode is evicted and reloaded, its i_size ends up being incorrect
(an inode's i_size is set to the disk_i_size field when an inode is
loaded). This happens when btrfs_truncate_inode_items() doesn't find
any file extent items to drop - in this case it never makes a call to
btrfs_ordered_update_i_size() in order to update the inode's disk_i_size.
Example reproducer:
$ mkfs.btrfs -O no-holes -f /dev/sdd
$ mount /dev/sdd /mnt
# Create our test file with some data and durably persist it.
$ xfs_io -f -c "pwrite -S 0xaa 0 128K" /mnt/foo
$ sync
# Append some data to the file, increasing its size, and leave a hole
# between the old size and the start offset if the following write. So
# our file gets a hole in the range [128Kb, 256Kb[.
$ xfs_io -c "truncate 160K" /mnt/foo
# We expect to see our file with a size of 160Kb, with the first 128Kb
# of data all having the value 0xaa and the remaining 32Kb of data all
# having the value 0x00.
$ od -t x1 /mnt/foo 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
* 0400000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
* 0500000
# Now cleanly unmount and mount again the filesystem.
$ umount /mnt
$ mount /dev/sdd /mnt
# We expect to get the same result as before, a file with a size of
# 160Kb, with the first 128Kb of data all having the value 0xaa and the
# remaining 32Kb of data all having the value 0x00.
$ od -t x1 /mnt/foo 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
* 0400000
In the example above the file size/data do not match what they were before
the remount.
Fix this by always calling btrfs_ordered_update_i_size() with a size
matching the size the file was truncated to if btrfs_truncate_inode_items()
is not called for a log tree and no file extent items were dropped. This
ensures the same behaviour as when the no_holes feature is not enabled.
Even though we delay the rename of directories when they become
descendents of other directories that were also renamed in the send
root to prevent infinite path build loops, we were doing it in cases
where this was not needed and was actually harmful resulting in
infinite path build loops as we ended up with a circular dependency
of delayed directory renames.
Consider the following reproducer:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt2
This reproducer resulted in an infinite path build loop when building the
path for inode 266 because the following circular dependency of delayed
directory renames was created:
Where the notation "X <- Y" means the rename of inode X is delayed by the
rename of inode Y (X will be renamed after Y is renamed). This resulted
in an infinite path build loop of inode 266 because that inode has inode
261 as an ancestor in the send root and inode 261 is in the circular
dependency of delayed renames listed above.
Fix this by not delaying the rename of a directory inode if an ancestor of
the inode in the send root, which has a delayed rename operation, is not
also a descendent of the inode in the parent root.
Thanks to Robbie Ko for sending the reproducer example.
A test case for xfstests follows soon.
Reported-by: Robbie Ko <robbieko@synology.com> Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from mainline commit 80aa6027561eef12b49031d46fd6724daf1e7fb6) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
We have another case where after an fsync log replay we get an inode with
a wrong link count (smaller than it should be) and a number of directory
entries greater than its link count. This happens when we add a new link
hard link to our inode A and then we fsync some other inode B that has
the side effect of logging the parent directory inode too. In this case
at log replay time we add the new hard link to our inode (the item with
key BTRFS_INODE_REF_KEY) when processing the parent directory but we
never adjust the link count of our inode A. As a result we get stale dir
entries for our inode A that can never be deleted and therefore it makes
it impossible to remove the parent directory (as its i_size can never
decrease back to 0).
A simple reproducer for fstests that triggers this issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
_cleanup_flakey
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
. ./common/dmflakey
# real QA test starts here
_need_to_be_root
_supported_fs generic
_supported_os Linux
_require_scratch
_require_dm_flakey
_require_metadata_journaling $SCRATCH_DEV
# Create our test directory and files.
mkdir $SCRATCH_MNT/testdir
touch $SCRATCH_MNT/testdir/foo
touch $SCRATCH_MNT/testdir/bar
# Make sure everything done so far is durably persisted.
sync
# Create one hard link for file foo and another one for file bar. After
# that fsync only the file bar.
ln $SCRATCH_MNT/testdir/bar $SCRATCH_MNT/testdir/bar_link
ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/foo_link
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/bar
# Silently drop all writes on scratch device to simulate power failure.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
# Allow writes again and mount the fs to trigger log/journal replay.
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# Now verify both our files have a link count of 2.
echo "Link count for file foo: $(stat --format=%h $SCRATCH_MNT/testdir/foo)"
echo "Link count for file bar: $(stat --format=%h $SCRATCH_MNT/testdir/bar)"
# We should be able to remove all the links of our files in testdir, and
# after that the parent directory should become empty and therefore
# possible to remove it.
rm -f $SCRATCH_MNT/testdir/*
rmdir $SCRATCH_MNT/testdir
_unmount_flakey
# The fstests framework will call fsck against our filesystem which will verify
# that all metadata is in a consistent state.
status=0
exit
The test fails with:
-Link count for file foo: 2
+Link count for file foo: 1
Link count for file bar: 2
+rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo_link': Stale file handle
+rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
(...)
_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
And fsck's output:
(...)
checking fs roots
root 5 inode 258 errors 2001, no inode item, link count wrong
unresolved ref dir 257 index 5 namelen 8 name foo_link filetype 1 errors 4, no inode ref
Checking filesystem on /dev/sdc
(...)
So fix this by marking inodes for link count fixup at log replay time
whenever a directory entry is replayed if the entry was created in the
transaction where the fsync was made and if it points to a non-directory
inode.
This isn't a new problem/regression, the issue exists for a long time,
possibly since the log tree feature was added (2008).
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from mainline commit bb53eda9029fd52b466fa501ba4aa58e94789b18) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
This fixes a regression introduced by 37b8d27d between v4.1 and v4.2.
When a snapshot is received, its received_uuid is set to the original
uuid of the subvolume. When that snapshot is then resent to a third
filesystem, it's received_uuid is set to the second uuid
instead of the original one. The same was true for the parent_uuid.
This behaviour was partially changed in 37b8d27d, but in that patch
only the parent_uuid was taken from the real original,
not the uuid itself, causing the search for the parent to fail in
the case below.
This happens for example when trying to send a series of linked
snapshots (e.g. created by snapper) from the backup file system back
to the original one.
The following commands reproduce the issue in v4.2.1
(no error in 4.1.6)
# setup three test file systems
for i in 1 2 3; do
truncate -s 50M fs$i
mkfs.btrfs fs$i
mkdir $i
mount fs$i $i
done
echo "content" > 1/testfile
btrfs su snapshot -r 1/ 1/snap1
echo "changed content" > 1/testfile
btrfs su snapshot -r 1/ 1/snap2
Signed-off-by: Robin Ruede <rruede+git@gmail.com> Fixes: 37b8d27de5d0 ("Btrfs: use received_uuid of parent during send") Cc: stable@vger.kernel.org # v4.2+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Ed Tomlinson <edt@aei.ca>
(cherry picked from commit b96b1db039ebc584d03a9933b279e0d3e704c528) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Hans Westgaard Ry [Wed, 3 Feb 2016 08:26:57 +0000 (09:26 +0100)]
net:Add sysctl_max_skb_frags
Devices may have limits on the number of fragments in an skb they support.
Current codebase uses a constant as maximum for number of fragments one
skb can hold and use.
When enabling scatter/gather and running traffic with many small messages
the codebase uses the maximum number of fragments and may thereby violate
the max for certain devices.
The patch introduces a global variable as max number of fragments.
Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 5f74f82ea34c0da80ea0b49192bb5ea06e063593)
David Jeffery [Fri, 28 Aug 2015 04:50:45 +0000 (14:50 +1000)]
xfs: return errors from partial I/O failures to files
There is an issue with xfs's error reporting in some cases of I/O partially
failing and partially succeeding. Calls like fsync() can report success even
though not all I/O was successful in partial-failure cases such as one disk of
a RAID0 array being offline.
The issue can occur when there are more than one bio per xfs_ioend struct.
Each call to xfs_end_bio() for a bio completing will write a value to
ioend->io_error. If a successful bio completes after any failed bio, no
error is reported do to it writing 0 over the error code set by any failed bio.
The I/O error information is now lost and when the ioend is completed
only success is reported back up the filesystem stack.
xfs_end_bio() should only set ioend->io_error in the case of BIO_UPTODATE
being clear. ioend->io_error is initialized to 0 at allocation so only needs
to be updated by a failed bio. Also check that ioend->io_error is 0 so that
the first error reported will be the error code returned.
Cc: stable@vger.kernel.org Signed-off-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit c9eb256eda4420c06bb10f5e8fbdbe1a34bc98e0)
David Jeffery [Fri, 28 Aug 2015 04:50:45 +0000 (14:50 +1000)]
xfs: return errors from partial I/O failures to files
There is an issue with xfs's error reporting in some cases of I/O partially
failing and partially succeeding. Calls like fsync() can report success even
though not all I/O was successful in partial-failure cases such as one disk of
a RAID0 array being offline.
The issue can occur when there are more than one bio per xfs_ioend struct.
Each call to xfs_end_bio() for a bio completing will write a value to
ioend->io_error. If a successful bio completes after any failed bio, no
error is reported do to it writing 0 over the error code set by any failed bio.
The I/O error information is now lost and when the ioend is completed
only success is reported back up the filesystem stack.
xfs_end_bio() should only set ioend->io_error in the case of BIO_UPTODATE
being clear. ioend->io_error is initialized to 0 at allocation so only needs
to be updated by a failed bio. Also check that ioend->io_error is 0 so that
the first error reported will be the error code returned.
Cc: stable@vger.kernel.org Signed-off-by: David Jeffery <djeffery@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
Orabug : 22681464
Mainline v4.3 commit c9eb256eda4420c06bb10f5e8fbdbe1a34bc98e0 Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>
Don't strip leading zeros from the crypto key ID when using it to construct
the struct key description as the signature in kernels up to and including
4.2 matched this aspect of the key. This means that 1 in 256 keys won't
actually match if their key ID begins with 00.
The key ID is stored in the module signature as binary and so must be
converted to text in order to invoke request_key() - but it isn't stripped
at this point.
Something like this is likely to be observed in dmesg when the key is loaded:
The 'Loaded' line should show an extra '00' on the front of the hex string.
This problem should not affect 4.3-rc1 and onwards because there the key
should be matched on one of its auxiliary identities rather than the key
struct's description string.
Reported-by: Arjan van de Ven <arjan@linux.intel.com> Reported-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: David Howells <dhowells@redhat.com>
(cherry picked from mainline commit e7c87bef7de2417b219d4dbfe8d33a0098a8df54) Signed-off-by: Dan Duval <dan.duval@oracle.com>