Subject : [PATCH v2] bcache: fix a livelock in btree lock
Date : Wed, 25 Feb 2015 20:32:09 +0800 (02/25/2015 04:32:09 AM)
This commit tries to fix a livelock in bcache. This livelock might
happen when we causes a huge number of cache misses simultaneously.
When we get a cache miss, bcache will execute the following path.
->cached_dev_make_request()
->cached_dev_read()
->cached_lookup()
->bch->btree_map_keys()
->btree_root() <------------------------
->bch_btree_map_keys_recurse() |
->cache_lookup_fn() |
->cached_dev_cache_miss() |
->bch_btree_insert_check_key() -|
[If btree->seq is not equal to seq + 1, we should return
EINTR and traverse btree again.]
In bch_btree_insert_check_key() function we first need to check upgrade
flag (op->lock == -1), and when this flag is true we need to release
read btree->lock and try to take write btree->lock. During taking and
releasing this write lock, btree->seq will be monotone increased in
order to prevent other threads modify this in cache miss (see btree.h:74).
But if there are some cache misses caused by some requested, we could
meet a livelock because btree->seq is always changed by others. Thus no
one can make progress.
This commit will try to take write btree->lock if it encounters a race
when we traverse btree. Although it sacrifice the scalability but we
can ensure that only one can modify the btree.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Tested-by: Joshua Schmid <jschmid@suse.com> Tested-by: Eric Wheeler <bcache@linux.ewheeler.net> Cc: Joshua Schmid <jschmid@suse.com> Cc: Zhu Yanhai <zhu.yanhai@gmail.com> Cc: Kent Overstreet <kmo@daterainc.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit c505638900d191ae10d1f8afdc8d09ca8c2de0d4)
If a NFSv4 client uses the cache_consistency_bitmask in order to
request only information about the change attribute, timestamps and
size, then it has not revalidated all attributes, and hence the
attribute timeout timestamp should not be updated.
Reported-by: Donald Buczek <buczek@molgen.mpg.de> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 8572f2fa313c47750609178b20caf0598fd1042a)
We've seen this in a packet capture - I've intermixed what I
think was going on. The fix here is to grab the so_lock sooner.
1964379 -> #1 open (for write) reply seqid=1 1964393 -> #2 open (for read) reply seqid=2
__nfs4_close(), state->n_wronly--
nfs4_state_set_mode_locked(), changes state->state = [R]
state->flags is [RW]
state->state is [R], state->n_wronly == 0, state->n_rdonly == 1
1964398 -> #3 open (for write) call -> because close is already running 1964399 -> downgrade (to read) call seqid=2 (close of #1) 1964402 -> #3 open (for write) reply seqid=3
__update_open_stateid()
nfs_set_open_stateid_locked(), changes state->flags
state->flags is [RW]
state->state is [R], state->n_wronly == 0, state->n_rdonly == 1
new sequence number is exposed now via nfs4_stateid_copy()
next step would be update_open_stateflags(), pending so_lock
1964403 -> downgrade reply seqid=2, fails with OLD_STATEID (close of #1)
nfs4_close_prepare() gets so_lock and recalcs flags -> send close
The driver allocates the spinlock but fails to initialize it correctly.
The kernel reports a BUG indicating bad spinlock magic when spinlock
debugging is enabled.
Call spin_lock_init() on it to initialize it correctly.
udf_next_aext() just follows extent pointers while extents are marked as
indirect. This can loop forever for corrupted filesystem. Limit number
the of indirect extents we are willing to follow in a row.
[JK: Updated changelog, limit, style]
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: stable@vger.kernel.org Cc: Jan Kara <jack@suse.com> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 4476dc677064d27c1fb6ff28a47eeac384e8ba64)
If the proxy lock in the requeue loop acquires the rtmutex for a
waiter then it acquired also refcount on the pi_state related to the
futex, but the waiter side does not drop the reference count.
Add the missing free_pi_state() call.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Darren Hart <darren@dvhart.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Bhuvanesh_Surachari@mentor.com Cc: Andy Lowe <Andy_Lowe@mentor.com> Link: http://lkml.kernel.org/r/20151219200607.178132067@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit e70aade221a271f91e2d71901b2d602df2faee15)
When a thin pool is being destroyed delayed work items are
cancelled using cancel_delayed_work(), which doesn't guarantee that on
return the delayed item isn't running. This can cause the work item to
requeue itself on an already destroyed workqueue. Fix this by using
cancel_delayed_work_sync() which guarantees that on return the work item
is not running anymore.
Fixes: 905e51b39a555 ("dm thin: commit outstanding data every second") Fixes: 85ad643b7e7e5 ("dm thin: add timeout to stop out-of-data-space mode holding IO forever") Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 9e612a000cf889cfa624d984ccef770fdce43cf6)
EDAC workqueue destruction is really fragile. We cancel delayed work
but if it is still running and requeues itself, we still go ahead and
destroy the workqueue and the queued work explodes when workqueue core
attempts to run it.
Make the destruction more robust by switching op_state to offline so
that requeuing stops. Cancel any pending work *synchronously* too.
I get the splat below when modprobing/rmmoding EDAC drivers. It happens
because bus->name is invalid after bus_unregister() has run. The Code: section
below corresponds to:
This patch fixes kernel crash at removing directory which contains
whiteouts from lower layers.
Cache of directory content passed as "list" contains entries from all
layers, including whiteouts from lower layers. So, lookup in upper dir
(moved into work at this stage) will return negative entry. Plus this
cache is filled long before and we can race with external removal.
Example:
mkdir -p lower0/dir lower1/dir upper work overlay
touch lower0/dir/a lower0/dir/b
mknod lower1/dir/a c 0 0
mount -t overlay none overlay -o lowerdir=lower1:lower0,upperdir=upper,workdir=work
rm -fr overlay/dir
The maximum chunks used by the function is
(SPI_AGGR_BUFFER_SIZE / WSPI_MAX_CHUNK_SIZE + 1).
The original commands array had space for
(SPI_AGGR_BUFFER_SIZE / WSPI_MAX_CHUNK_SIZE) commands.
When the last chunk is used (len > 4 * WSPI_MAX_CHUNK_SIZE), the last
command is stored outside the bounds of the commands array.
Oops 5 (page fault) is generated during current wl1271 firmware load
attempt:
root@debian-armhf:~# ifconfig wlan0 up
[ 294.312399] Unable to handle kernel paging request at virtual address 00203fc4
[ 294.320173] pgd = de528000
[ 294.323028] [00203fc4] *pgd=00000000
[ 294.326916] Internal error: Oops: 5 [#1] SMP ARM
[ 294.331789] Modules linked in: bnep rfcomm bluetooth ipv6 arc4 wl12xx
wlcore mac80211 musb_dsps cfg80211 musb_hdrc usbcore usb_common
wlcore_spi omap_rng rng_core musb_am335x omap_wdt cpufreq_dt thermal_sys
hwmon
[ 294.351838] CPU: 0 PID: 1827 Comm: ifconfig Not tainted 4.2.0-00002-g3e9ad27-dirty #78
[ 294.360154] Hardware name: Generic AM33XX (Flattened Device Tree)
[ 294.366557] task: dc9d6d40 ti: de550000 task.ti: de550000
[ 294.372236] PC is at __spi_validate+0xa8/0x2ac
[ 294.376902] LR is at __spi_sync+0x78/0x210
[ 294.381200] pc : [<c049c760>] lr : [<c049ebe0>] psr: 60000013
[ 294.381200] sp : de551998 ip : de5519d8 fp : 00200000
[ 294.393242] r10: de551c8c r9 : de5519d8 r8 : de3a9000
[ 294.398730] r7 : de3a9258 r6 : de3a9400 r5 : de551a48 r4 : 00203fbc
[ 294.405577] r3 : 00000000 r2 : 00000000 r1 : 00000000 r0 : de3a9000
[ 294.412420] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM
Segment user
[ 294.419918] Control: 10c5387d Table: 9e528019 DAC: 00000015
[ 294.425954] Process ifconfig (pid: 1827, stack limit = 0xde550218)
[ 294.432437] Stack: (0xde551998 to 0xde552000)
...
[ 294.883613] [<c049c760>] (__spi_validate) from [<c049ebe0>]
(__spi_sync+0x78/0x210)
[ 294.891670] [<c049ebe0>] (__spi_sync) from [<bf036598>]
(wl12xx_spi_raw_write+0xfc/0x148 [wlcore_spi])
[ 294.901661] [<bf036598>] (wl12xx_spi_raw_write [wlcore_spi]) from
[<bf21c694>] (wlcore_boot_upload_firmware+0x1ec/0x458 [wlcore])
[ 294.914038] [<bf21c694>] (wlcore_boot_upload_firmware [wlcore]) from
[<bf24532c>] (wl12xx_boot+0xc10/0xfac [wl12xx])
[ 294.925161] [<bf24532c>] (wl12xx_boot [wl12xx]) from [<bf20d5cc>]
(wl1271_op_add_interface+0x5b0/0x910 [wlcore])
[ 294.936364] [<bf20d5cc>] (wl1271_op_add_interface [wlcore]) from
[<bf15c4ac>] (ieee80211_do_open+0x44c/0xf7c [mac80211])
[ 294.947963] [<bf15c4ac>] (ieee80211_do_open [mac80211]) from
[<c0537978>] (__dev_open+0xa8/0x110)
[ 294.957307] [<c0537978>] (__dev_open) from [<c0537bf8>]
(__dev_change_flags+0x88/0x148)
[ 294.965713] [<c0537bf8>] (__dev_change_flags) from [<c0537cd0>]
(dev_change_flags+0x18/0x48)
[ 294.974576] [<c0537cd0>] (dev_change_flags) from [<c05a55a0>]
(devinet_ioctl+0x6b4/0x7d0)
[ 294.983191] [<c05a55a0>] (devinet_ioctl) from [<c0517040>]
(sock_ioctl+0x1e4/0x2bc)
[ 294.991244] [<c0517040>] (sock_ioctl) from [<c017d378>]
(do_vfs_ioctl+0x420/0x6b0)
[ 294.999208] [<c017d378>] (do_vfs_ioctl) from [<c017d674>]
(SyS_ioctl+0x6c/0x7c)
[ 295.006880] [<c017d674>] (SyS_ioctl) from [<c000f4c0>]
(ret_fast_syscall+0x0/0x54)
[ 295.014835] Code: e1550004e24440340a00007de5953018 (e5942008)
[ 295.021544] ---[ end trace 66ed188198f4e24e ]---
Signed-off-by: Uri Mashiach <uri.mashiach@compulab.co.il> Acked-by: Igor Grinberg <grinberg@compulab.co.il> Cc: stable@vger.kernel.org Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 3d51d36e7148182e76fc36d66c494d192c410fdd)
Free skb for received frames with a wrong checksum. This can happen
pretty rapidly, exhausting all memory.
This fixes a memleak (detected with kmemleak). Originally found while
using monitor mode, but it also appears during managed mode (once the
link is up).
Cc: stable@vger.kernel.org Signed-off-by: Peter Wu <peter@lekensteyn.nl> ACKed-by: Larry Finger <Larry.Finger@lwfinger.net> Signed-off-by: Kalle Valo <kvalo@codeaurora.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 76e8046ae4f3c7abd9e4f8ab69000609e554eb77)
1e75fa8 "time: Condense timekeeper.xtime into xtime_sec" replaced a call to
clocksource_cyc2ns() from timekeeping_get_ns() with an open-coded version
of the same logic to avoid keeping a semi-redundant struct timespec
in struct timekeeper.
However, the commit also introduced a subtle semantic change - where
clocksource_cyc2ns() uses purely unsigned math, the new version introduces
a signed temporary, meaning that if (delta * tk->mult) has a 63-bit
overflow the following shift will still give a negative result. The
choice of 'maxsec' in __clocksource_updatefreq_scale() means this will
generally happen if there's a ~10 minute pause in examining the
clocksource.
This can be triggered on a powerpc KVM guest by stopping it from qemu for
a bit over 10 minutes. After resuming time has jumped backwards several
minutes causing numerous problems (jiffies does not advance, msleep()s can
be extended by minutes..). It doesn't happen on x86 KVM guests, because
the guest TSC is effectively frozen while the guest is stopped, which is
not the case for the powerpc timebase.
Obviously an unsigned (64 bit) overflow will only take twice as long as a
signed, 63-bit overflow. I don't know the time code well enough to know
if that will still cause incorrect calculations, or if a 64-bit overflow
is avoided elsewhere.
Still, an incorrect forwards clock adjustment will cause less trouble than
time going backwards. So, this patch removes the potential for
intermediate signed overflow.
Cc: stable@vger.kernel.org (3.7+) Suggested-by: Laurent Vivier <lvivier@redhat.com> Tested-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 51975a203210d3a94a7711651b1045ed0e3514ed)
v4l2-compliance sends a zeroed struct v4l2_streamparm in
v4l2-test-formats.cpp::testParmType(), and this results in a division by
0 in some gspca subdrivers:
Sure, it's better to bail out of past-the-eof read and return 0 than return
a bogus negative value on such. Only we'd better make sure we are bailing out
with 0 and not -ENOMEM...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 57b7d61c2d89e3e0665b9cdcc8fb73b08d10f0c2)
Assume a filesystem with 4KB blocks. When a file has size 1000 bytes and
we issue direct IO read at offset 1024, blockdev_direct_IO() reads the
tail of the last block and the logic for handling short DIO reads in
dio_complete() results in a return value -24 (1000 - 1024) which
obviously confuses userspace.
Fix the problem by bailing out early once we sample i_size and can
reliably check that direct IO read starts beyond i_size.
Reported-by: Avi Kivity <avi@scylladb.com> Fixes: 9fe55eea7e4b444bafc42fa0000cc2d1d2847275 CC: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 8885e7f3d76a9d42eb4fd97a3cfc5a8ac7f6f2ff)
The vivid_fb_ioctl() code fails to initialize the 16 _reserved bytes of
struct fb_vblank after the ->hcount member. Add an explicit
memset(0) before filling the structure to avoid the info leak.
hid_connect adds various strings to the buffer but they're all
conditional. You can find circumstances where nothing would be written
to it but the kernel will still print the supposedly empty buffer with
printk. This leads to corruption on the console/in the logs.
Ensure buf is initialized to an empty string.
Signed-off-by: Richard Purdie <richard.purdie@linuxfoundation.org>
[dvhart: Initialize string to "" rather than assign buf[0] = NULL;] Cc: Jiri Kosina <jikos@kernel.org> Cc: linux-input@vger.kernel.org Signed-off-by: Darren Hart <dvhart@linux.intel.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b50a2b556d1e487c2430e13042a98325adca0be5)
Remove the dst_entries_init/destroy calls for xfrm4 and xfrm6 dst_ops
templates; their dst_entries counters will never be used. Move the
xfrm dst_ops initialization from the common xfrm/xfrm_policy.c to
xfrm4/xfrm4_policy.c and xfrm6/xfrm6_policy.c, and call dst_entries_init
and dst_entries_destroy for each net namespace.
The ipv4 and ipv6 xfrms each create dst_ops template, and perform
dst_entries_init on the templates. The template values are copied to each
net namespace's xfrm.xfrm*_dst_ops. The problem there is the dst_ops
pcpuc_entries field is a percpu counter and cannot be used correctly by
simply copying it to another object.
The result of this is a very subtle bug; changes to the dst entries
counter from one net namespace may sometimes get applied to a different
net namespace dst entries counter. This is because of how the percpu
counter works; it has a main count field as well as a pointer to the
percpu variables. Each net namespace maintains its own main count
variable, but all point to one set of percpu variables. When any net
namespace happens to change one of the percpu variables to outside its
small batch range, its count is moved to the net namespace's main count
variable. So with multiple net namespaces operating concurrently, the
dst_ops entries counter can stray from the actual value that it should
be; if counts are consistently moved from one net namespace to another
(which my testing showed is likely), then one net namespace winds up
with a negative dst_ops count while another winds up with a continually
increasing count, eventually reaching its gc_thresh limit, which causes
all new traffic on the net namespace to fail with -ENOBUFS.
Signed-off-by: Dan Streetman <dan.streetman@canonical.com> Signed-off-by: Dan Streetman <ddstreet@ieee.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit eeca98948d8c4922e6deb16bfc9ee0bd9902dbb0)
A case can occur when sctp_accept() is called by the user during
a heartbeat timeout event after the 4-way handshake. Since
sctp_assoc_migrate() changes both assoc->base.sk and assoc->ep, the
bh_sock_lock in sctp_generate_heartbeat_event() will be taken with
the listening socket but released with the new association socket.
The result is a deadlock on any future attempts to take the listening
socket lock.
Note that this race can occur with other SCTP timeouts that take
the bh_lock_sock() in the event sctp_accept() is called.
=====================================
[ BUG: bad unlock balance detected! ]
-------------------------------------
CslRx/12087 is trying to release lock (slock-AF_INET) at:
[<ffffffffa01bcae0>] sctp_generate_timeout_event+0x40/0xe0 [sctp]
but there are no more locks to release!
other info that might help us debug this:
2 locks held by CslRx/12087:
#0: (&asoc->timers[i]){+.-...}, at: [<ffffffff8108ce1f>] run_timer_softirq+0x16f/0x3e0
#1: (slock-AF_INET){+.-...}, at: [<ffffffffa01bcac3>] sctp_generate_timeout_event+0x23/0xe0 [sctp]
Ensure the socket taken is also the same one that is released by
saving a copy of the socket before entering the timeout event
critical section.
Signed-off-by: Karl Heiss <kheiss@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 534e9016cd88ccd577b226b7172e5cd079f5fb02)
We can't be within an RCU read-side critical section when deleting
VLANs, as underlying drivers might sleep during the hardware operation.
Therefore, replace the RCU critical section with a mutex. This is
consistent with team_vlan_rx_add_vid.
Fixes: 3d249d4ca7d0 ("net: introduce ethernet teaming device") Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 490c963c1eb9a7f5f74446f4b9738c9fb1b8e325)
It is not allowed to free the memory of an object which is part of a list
which is protected by rcu-read-side-critical sections without making sure
that no other context is accessing the object anymore. This usually happens
by removing the references to this object and then waiting until the rcu
grace period is over and no one (allowedly) accesses it anymore.
But the _now functions ignore this completely. They free the object
directly even when a different context still tries to access it. This has
to be avoided and thus these functions must be removed and all functions
have to use batadv_orig_node_free_ref.
Fixes: 72822225bd41 ("batman-adv: Fix rcu_barrier() miss due to double call_rcu() in TT code") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 95785b105fa2f784ad7b6be0238338b648ed892a)
It is not allowed to free the memory of an object which is part of a list
which is protected by rcu-read-side-critical sections without making sure
that no other context is accessing the object anymore. This usually happens
by removing the references to this object and then waiting until the rcu
grace period is over and no one (allowedly) accesses it anymore.
But the _now functions ignore this completely. They free the object
directly even when a different context still tries to access it. This has
to be avoided and thus these functions must be removed and all functions
have to use batadv_hardif_free_ref.
Fixes: 89652331c00f ("batman-adv: split tq information in neigh_node struct") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit ae3eb44e0e8b4e0fe4ab51e0a91d76517abc1e30)
It is not allowed to free the memory of an object which is part of a list
which is protected by rcu-read-side-critical sections without making sure
that no other context is accessing the object anymore. This usually happens
by removing the references to this object and then waiting until the rcu
grace period is over and no one (allowedly) accesses it anymore.
But the _now functions ignore this completely. They free the object
directly even when a different context still tries to access it. This has
to be avoided and thus these functions must be removed and all functions
have to use batadv_neigh_ifinfo_free_ref.
Fixes: 89652331c00f ("batman-adv: split tq information in neigh_node struct") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 924224c6e44a393a2a41385d699230384ea23df5)
It is not allowed to free the memory of an object which is part of a list
which is protected by rcu-read-side-critical sections without making sure
that no other context is accessing the object anymore. This usually happens
by removing the references to this object and then waiting until the rcu
grace period is over and no one (allowedly) accesses it anymore.
But the _now functions ignore this completely. They free the object
directly even when a different context still tries to access it. This has
to be avoided and thus these functions must be removed and all functions
have to use batadv_neigh_node_free_ref.
Fixes: 89652331c00f ("batman-adv: split tq information in neigh_node struct") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 620493a90c78ce1bd150d48ca6f624ad964e0c8d)
It is not allowed to free the memory of an object which is part of a list
which is protected by rcu-read-side-critical sections without making sure
that no other context is accessing the object anymore. This usually happens
by removing the references to this object and then waiting until the rcu
grace period is over and no one (allowedly) accesses it anymore.
But the _now functions ignore this completely. They free the object
directly even when a different context still tries to access it. This has
to be avoided and thus these functions must be removed and all functions
have to use batadv_orig_ifinfo_free_ref.
Fixes: 7351a4822d42 ("batman-adv: split out router from orig_node") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 9d188c6b672c7d205f43da74b534d2114371edc5)
The batadv_nc_node_free_ref function uses call_rcu to delay the free of the
batadv_nc_node object until no (already started) rcu_read_lock is enabled
anymore. This makes sure that no context is still trying to access the
object which should be removed. But batadv_nc_node also contains a
reference to orig_node which must be removed.
The reference drop of orig_node was done in the call_rcu function
batadv_nc_node_free_rcu but should actually be done in the
batadv_nc_node_release function to avoid nested call_rcus. This is
important because rcu_barrier (e.g. batadv_softif_free or batadv_exit) will
not detect the inner call_rcu as relevant for its execution. Otherwise this
barrier will most likely be inserted in the queue before the callback of
the first call_rcu was executed. The caller of rcu_barrier will therefore
continue to run before the inner call_rcu callback finished.
Fixes: d56b1705e28c ("batman-adv: network coding - detect coding nodes and remove these after timeout") Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 34c5bf7c7bf4e285274e28a36b08cbf3da5bd3e3)
The batadv_claim_free_ref function uses call_rcu to delay the free of the
batadv_bla_claim object until no (already started) rcu_read_lock is enabled
anymore. This makes sure that no context is still trying to access the
object which should be removed. But batadv_bla_claim also contains a
reference to backbone_gw which must be removed.
The reference drop of backbone_gw was done in the call_rcu function
batadv_claim_free_rcu but should actually be done in the
batadv_claim_release function to avoid nested call_rcus. This is important
because rcu_barrier (e.g. batadv_softif_free or batadv_exit) will not
detect the inner call_rcu as relevant for its execution. Otherwise this
barrier will most likely be inserted in the queue before the callback of
the first call_rcu was executed. The caller of rcu_barrier will therefore
continue to run before the inner call_rcu callback finished.
Fixes: 23721387c409 ("batman-adv: add basic bridge loop avoidance code") Signed-off-by: Sven Eckelmann <sven@narfation.org> Acked-by: Simon Wunderlich <sw@simonwunderlich.de> Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch> Signed-off-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 016cb1d02db9840a62c0e2a465ffd6617eacffc7)
Currently slhc_init() treats out-of-range values of rslots and tslots
as equivalent to 0, except that if tslots is too large it will
dereference a null pointer (CVE-2015-7799).
Add a range-check at the top of the function and make it return an
ERR_PTR() on error instead of NULL. Change the callers accordingly.
Compile-tested only.
Reported-by: 郭永刚 <guoyonggang@360.cn>
References: http://article.gmane.org/gmane.comp.security.oss.general/17908 Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit a50a93cc99286dc444c7e5ccc7dfb9d58c2d346d)
Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 5984398539a2c47834caf1b00dc9f58b7bb2e67a)
When creating a timewait socket, we need to arm the timer before
allowing other cpus to find it. The signal allowing cpus to find
the socket is setting tw_refcnt to non zero value.
As we set tw_refcnt in __inet_twsk_hashdance(), we therefore need to
call inet_twsk_schedule() first.
This also means we need to remove tw_refcnt changes from
inet_twsk_schedule() and let the caller handle it.
Note that because we use mod_timer_pinned(), we have the guarantee
the timer wont expire before we set tw_refcnt as we run in BH context.
To make things more readable I introduced inet_twsk_reschedule() helper.
When rearming the timer, we can use mod_timer_pending() to make sure
we do not rearm a canceled timer.
Note: This bug can possibly trigger if packets of a flow can hit
multiple cpus. This does not normally happen, unless flow steering
is broken somehow. This explains this bug was spotted ~5 months after
its introduction.
A similar fix is needed for SYN_RECV sockets in reqsk_queue_hash_req(),
but will be provided in a separate patch for proper tracking.
Fixes: 789f558cfb36 ("tcp/dccp: get rid of central timewait timer") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Ying Cai <ycai@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 479b539a310152b10c1747ac8c5354ed7bc25181)
After promisc mode management was introduced a bridge device could do
dev_set_promiscuity from its ndo_change_rx_flags() callback which in
turn can be called after the bridge's addr_list_lock has been taken
(e.g. by dev_uc_add). This causes a false positive lockdep splat because
the port interfaces' addr_list_lock is taken when br_manage_promisc()
runs after the bridge's addr list lock was already taken.
To remove the false positive introduce a custom bridge addr_list_lock
class and set it on bridge init.
A simple way to reproduce this is with the following:
$ brctl addbr br0
$ ip l add l br0 br0.100 type vlan id 100
$ ip l set br0 up
$ ip l set br0.100 up
$ echo 1 > /sys/class/net/br0/bridge/vlan_filtering
$ brctl addif br0 eth0
Splat:
[ 43.684325] =============================================
[ 43.684485] [ INFO: possible recursive locking detected ]
[ 43.684636] 4.4.0-rc8+ #54 Not tainted
[ 43.684755] ---------------------------------------------
[ 43.684906] brctl/1187 is trying to acquire lock:
[ 43.685047] (_xmit_ETHER){+.....}, at: [<ffffffff8150169e>] dev_set_rx_mode+0x1e/0x40
[ 43.685460] but task is already holding lock:
[ 43.685618] (_xmit_ETHER){+.....}, at: [<ffffffff815072a7>] dev_uc_add+0x27/0x80
[ 43.686015] other info that might help us debug this:
[ 43.686316] Possible unsafe locking scenario:
CC: Vlad Yasevich <vyasevic@redhat.com> CC: Stephen Hemminger <stephen@networkplumber.org> CC: Bridge list <bridge@lists.linux-foundation.org> CC: Andy Gospodarek <gospo@cumulusnetworks.com> CC: Roopa Prabhu <roopa@cumulusnetworks.com> Fixes: 2796d0c648c9 ("bridge: Automatically manage port promiscuous mode.") Reported-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 332fb8799ed1c701339e9b4dad3efbed245b4cf8)
When a tunnel decapsulates the outer header, it has to comply
with RFC 6080 and eventually propagate CE mark into inner header.
It turns out IP6_ECN_set_ce() does not correctly update skb->csum
for CHECKSUM_COMPLETE packets, triggering infamous "hw csum failure"
messages and stack traces.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 2980502b9f6d3e15a4dd6695d4cfff8ec3213209)
On ARM64, a BUG() is triggered in the eBPF JIT if a filter with a
constant shift that can't be encoded in the immediate field of the
UBFM/SBFM instructions is passed to the JIT. Since these shifts
amounts, which are negative or >= regsize, are invalid, reject them in
the eBPF verifier and the classic BPF filter checker, for all
architectures.
Signed-off-by: Rabin Vincent <rabin@rab.in> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit dc1cfcc266978d0213d7d4f9f813bafd4e9f709a)
Skb_gso_segment() uses skb control block during segmentation.
This patch adds 32-bytes room for previous control block which
will be copied into all resulting segments.
This patch fixes kernel crash during fragmenting forwarded packets.
Fragmentation requires valid IP CB in skb for clearing ip options.
Also patch removes custom save/restore in ovs code, now it's redundant.
Commit acf8dd0a9d0b ("udp: only allow UFO for packets from SOCK_DGRAM
sockets") disallows UFO for packets sent from raw sockets. We need to do
the same also for SOCK_DGRAM sockets with SO_NO_CHECK options, even if
for a bit different reason: while such socket would override the
CHECKSUM_PARTIAL set by ip_ufo_append_data(), gso_size is still set and
bad offloading flags warning is triggered in __skb_gso_segment().
In the IPv6 case, SO_NO_CHECK option is ignored but we need to disallow
UFO for packets sent by sockets with UDP_NO_CHECK6_TX option.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Tested-by: Shannon Nelson <shannon.nelson@intel.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 4bb526bce19d6e4c8cc862ce304b8c072ef43366)
For tcp_yeah, use an ssthresh floor of 2, the same floor used by Reno
and CUBIC, per RFC 5681 (equation 4).
tcp_yeah_ssthresh() was sometimes returning a 0 or negative ssthresh
value if the intended reduction is as big or bigger than the current
cwnd. Congestion control modules should never return a zero or
negative ssthresh. A zero ssthresh generally results in a zero cwnd,
causing the connection to stall. A negative ssthresh value will be
interpreted as a u32 and will set a target cwnd for PRR near 4
billion.
Oleksandr Natalenko reported that a system using tcp_yeah with ECN
could see a warning about a prior_cwnd of 0 in
tcp_cwnd_reduction(). Testing verified that this was due to
tcp_yeah_ssthresh() misbehaving in this way.
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit e7bbeacbafc18a7781036c6111ffd9fe9eaa6ec0)
When first SYNACK is sent, we already hold rcu_read_lock(), but this
is not true if a SYNACK is retransmitted, as a timer (soft) interrupt
does not hold rcu_read_lock()
Fixes: 45f6fad84cc30 ("ipv6: add complete rcu protection around np->opt") Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit add92082e2d14367b27b0e18b0deeaedd7c1f938)
proc_dostring() needs an initialized destination string, while the one
provided in proc_sctp_do_hmac_alg() contains stack garbage.
Thus, writing to cookie_hmac_alg would strlen() that garbage and end up
accessing invalid memory.
Fixes: 3c68198e7 ("sctp: Make hmac algorithm selection for cookie generation dynamic") Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit bcea43fb3164cdcdc42c710ca0cb90aa9c6a5530)
When a vxlan interface is created, the driver checks that there is not
another vxlan interface with the same properties. To do this, it checks
the existing vxlan udp socket. Since commit 1c51a9159dde, the creation of
the vxlan socket is done only when the interface is set up, thus it breaks
that test.
Example:
$ ip l a vxlan10 type vxlan id 10 group 239.0.0.10 dev eth0 dstport 0
$ ip l a vxlan11 type vxlan id 10 group 239.0.0.10 dev eth0 dstport 0
$ ip -br l | grep vxlan
vxlan10 DOWN f2:55:1c:6a:fb:00 <BROADCAST,MULTICAST>
vxlan11 DOWN 7a:cb:b9:38:59:0d <BROADCAST,MULTICAST>
Instead of checking sockets, let's loop over the vxlan iface list.
Fixes: 1c51a9159dde ("vxlan: fix race caused by dropping rtnl_unlock") Reported-by: Thomas Faivre <thomas.faivre@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit a33704eb60850ac837ef83ecc380f5f6b2aa2906)
dst_release should not access dst->flags after decrementing
__refcnt to 0. The dst_entry may be in dst_busy_list and
dst_gc_task may dst_destroy it before dst_release gets a chance
to access dst->flags.
Fixes: d69bbf88c8d0 ("net: fix a race in dst_release()") Fixes: 27b75c95f10d ("net: avoid RCU for NOCACHE dst") Signed-off-by: Francesco Ruggeri <fruggeri@arista.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 8f5cd6eea8115c6a4153268508369b81f84e9dce)
When a qdisc is using per cpu stats (currently just the ingress
qdisc) only the bstats are being freed. This also free's the qstats.
Fixes: b0ab6f92752b9f9d8 ("net: sched: enable per cpu qstats") Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 02a1fef61d06bb0417b60f1233320e141266860c)
The SKF_AD_ALU_XOR_X ancillary is not like the other ancillary data
instructions since it XORs A with X while all the others replace A with
some loaded value. All the BPF JITs fail to clear A if this is used as
the first instruction in a filter. This was found using american fuzzy
lop.
Add a helper to determine if A needs to be cleared given the first
instruction in a filter, and use this in the JITs. Except for ARM, the
rest have only been compile-tested.
Fixes: 3480593131e0 ("net: filter: get rid of BPF_S_* enum") Signed-off-by: Rabin Vincent <rabin@rab.in> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 5596242a6263ece70ee14f3b6861f02b8dc82d11)
[I stole this patch from Eric Biederman. He wrote:]
> There is no defined mechanism to pass network namespace information
> into /sbin/bridge-stp therefore don't even try to invoke it except
> for bridge devices in the initial network namespace.
>
> It is possible for unprivileged users to cause /sbin/bridge-stp to be
> invoked for any network device name which if /sbin/bridge-stp does not
> guard against unreasonable arguments or being invoked twice on the
> same network device could cause problems.
[Hannes: changed patch using netns_eq]
Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit a6b1d24893805bb4f2e7c7bc6f86a98ec37065ee)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
Commit 67af13e958a0de5dc4191d31e96dcf4240ea96d2, "unix: properly account for FDs passed
over unix sockets", added an element "unix_inflight" to struct user_struct. This broke
kABI. Even worse, the element was added in the middle of the structure, changing the
offsets of all the other existing elements that ended up after "unix_inflight".
This commit wraps the new element in "#ifndef __GENKSYMS__" so that the kABI checker
won't see it. It also moves the element to the end of the structure, so that existing
binary third-party modules won't be affected if they simply access existing elements
via pointers. A search of the kernel source hints that this structure won't be accessed
via the expected kinds of third-party modules (e.g., drivers) at all, so this change
should be fairly safe
(cherry picked from commit b3f5d5cd4fa654c63389b683976f2df7b25b6ceb)
It is possible for a process to allocate and accumulate far more FDs than
the process' limit by sending them over a unix socket then closing them
to keep the process' fd count low.
This change addresses this problem by keeping track of the number of FDs
in flight per user and preventing non-privileged processes from having
more FDs in flight than their configured FD limit.
Reported-by: socketpair@gmail.com Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Mitigates: CVE-2013-4312 (Linux 2.0+) Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit dc6b0ec667f67d4768e72c1b7f1bbc14ea52379c)
In sctp_close, sctp_make_abort_user may return NULL because of memory
allocation failure. If this happens, it will bypass any state change
and never free the assoc. The assoc has no chance to be freed and it
will be kept in memory with the state it had even after the socket is
closed by sctp_close().
So if sctp_make_abort_user fails to allocate memory, we should abort
the asoc via sctp_primitive_ABORT as well. Just like the annotation in
sctp_sf_cookie_wait_prm_abort and sctp_sf_do_9_1_prm_abort said,
"Even if we can't send the ABORT due to low memory delete the TCB.
This is a departure from our typical NOMEM handling".
But then the chunk is NULL (low memory) and the SCTP_CMD_REPLY cmd would
dereference the chunk pointer, and system crash. So we should add
SCTP_CMD_REPLY cmd only when the chunk is not NULL, just like other
places where it adds SCTP_CMD_REPLY cmd.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit c000d76ec0d8bbadb7972b374bcf7ffe34deb592)
When sysctl performs restrict writes, it allows to write from
a middle position of a sysctl file, which requires us to initialize
the table data before calling proc_dostring() for the write case.
Fixes: 3d1bec99320d ("ipv6: introduce secret_stable to ipv6_devconf") Reported-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 6866b52c56db189f8471a13f951221c45c0089c7)
ip6addrlbl_get() has never worked. If ip6addrlbl_hold() succeeded,
ip6addrlbl_get() will exit with '-ESRCH'. If ip6addrlbl_hold() failed,
ip6addrlbl_get() will use about to be free ip6addrlbl_entry pointer.
Fix this by inverting ip6addrlbl_hold() check.
Fixes: 2a8cc6c89039 ("[IPV6] ADDRCONF: Support RFC3484 configurable address selection policy table.") Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Cong Wang <cwang@twopensource.com> Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 8a39e24931d119c0418721923281d2b496275ad3)
Attempting to use this codec driver triggers a BUG() in regcache_sync()
since no cache type is set. The register map of this device is fairly
small and has few holes so a flat cache is suitable.
Signed-off-by: Mans Rullgard <mans@mansr.com> Acked-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 1fce19176555fdad16a9933965f36e514514d092)
Currently ALSA timer device doesn't take the disconnection into
account very well; it merely unlinks the timer device at disconnection
callback but does nothing else. Because of this, when an application
accessing the timer device is disconnected, it may release the
resource before actually closed. In most cases, it results in a
warning message indicating a leftover timer instance like:
ALSA: timer xxxx is busy?
But basically this is an open race.
This patch tries to address it. The strategy is like other ALSA
devices: namely,
- Manage card's refcount at each open/close
- Wake up the pending tasks at disconnection
- Check the shutdown flag appropriately at each possible call
Note that this patch has one ugly hack to handle the wakeup of pending
tasks. It'd be cleaner to introduce a new disconnect op to
snd_timer_instance ops. But since it would lead to internal ABI
breakage and it eventually increase my own work when backporting to
stable kernels, I took a different path to implement locally in
timer.c. A cleanup patch will follow at next for 4.5 kernel.
As HD-audio driver does deferred probe internally via workqueue, the
driver might go into the mixed state doing both probe and remove when
the module gets unloaded during the probe work. This eventually
triggers an Oops, unsurprisingly.
For avoiding this race, we just need to flush the pending probe work
explicitly before actually starting the resource release.
When a TLV ioctl with numid zero is handled, the driver may spew a
kernel warning with a stack trace at each call. The check was
intended obviously only for a kernel driver, but not for a user
interaction. Let's fix it.
hrtimer_cancel() waits for the completion from the callback, thus it
must not be called inside the callback itself. This was already a
problem in the past with ALSA hrtimer driver, and the early commit
[fcfdebe70759: ALSA: hrtimer - Fix lock-up] tried to address it.
However, the previous fix is still insufficient: it may still cause a
lockup when the ALSA timer instance reprograms itself in its callback.
Then it invokes the start function even in snd_timer_interrupt() that
is called in hrtimer callback itself, results in a CPU stall. This is
no hypothetical problem but actually triggered by syzkaller fuzzer.
This patch tries to fix the issue again. Now we call
hrtimer_try_to_cancel() at both start and stop functions so that it
won't fall into a deadlock, yet giving some chance to cancel the queue
if the functions have been called outside the callback. The proper
hrtimer_cancel() is called in anyway at closing, so this should be
enough.
This reverts one hunk of
commit ef44a1ec6eee ("ALSA: sound/core: use memdup_user()"), which
replaced a number of kmalloc followed by memcpy with memdup calls.
In this case, we are copying from a struct snd_pcm_hw_params32 to
a struct snd_pcm_hw_params, but the latter is 4 bytes longer than
the 32-bit version, so we need to separate kmalloc and copy calls.
This actually leads to an out-of-bounds memory access later on
in sound/soc/soc-pcm.c:soc_pcm_hw_params() (detected using KASan).
Fixes: ef44a1ec6eee ('ALSA: sound/core: use memdup_user()') Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 281bedb8728f02a9b8bd5f5e342883e6d255abdc)
This reverts one hunk of
commit ef44a1ec6eee ("ALSA: sound/core: use memdup_user()"), which
replaced a number of kmalloc followed by memcpy with memdup calls.
In this case, we are copying from a struct snd_seq_port_info32 to a
struct snd_seq_port_info, but the latter is 4 bytes longer than the
32-bit version, so we need to separate kmalloc and copy calls.
Fixes: ef44a1ec6eee ('ALSA: sound/core: use memdup_user()') Signed-off-by: Nicolas Boichat <drinkcat@chromium.org> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit d1a55757fefff39a7b71305a78c2722d3b9b6529)
ALSA timer instance object has a couple of linked lists and they are
unlinked unconditionally at snd_timer_stop(). Meanwhile
snd_timer_interrupt() unlinks it, but it calls list_del() which leaves
the element list itself unchanged. This ends up with unlinking twice,
and it was caught by syzkaller fuzzer.
The fix is to use list_del_init() variant properly there, too.
ALSA timer ioctls have an open race and this may lead to a
use-after-free of timer instance object. A simplistic fix is to make
each ioctl exclusive. We have already tread_sem for controlling the
tread, and extend this as a global mutex to be applied to each ioctl.
The downside is, of course, the worse concurrency. But these ioctls
aren't to be parallel accessible, in anyway, so it should be fine to
serialize there.
A slave timer instance might be still accessible in a racy way while
operating the master instance as it lacks of locking. Since the
master operation is mostly protected with timer->lock, we should cope
with it while changing the slave instance, too. Also, some linked
lists (active_list and ack_list) of slave instances aren't unlinked
immediately at stopping or closing, and this may lead to unexpected
accesses.
This patch tries to address these issues. It adds spin lock of
timer->lock (either from master or slave, which is equivalent) in a
few places. For avoiding a deadlock, we ensure that the global
slave_active_lock is always locked at first before each timer lock.
Also, ack and active_list of slave instances are properly unlinked at
snd_timer_stop() and snd_timer_close().
Last but not least, remove the superfluous call of _snd_timer_stop()
at removing slave links. This is a noop, and calling it may confuse
readers wrt locking. Further cleanup will follow in a later patch.
Actually we've got reports of use-after-free by syzkaller fuzzer, and
this hopefully fixes these issues.
ALSA sequencer code has an open race between the timer setup ioctl and
the close of the client. This was triggered by syzkaller fuzzer, and
a use-after-free was caught there as a result.
This patch papers over it by adding a proper queue->timer_mutex lock
around the timer-related calls in the relevant code path.
snd_seq_ioctl_remove_events() calls snd_seq_fifo_clear()
unconditionally even if there is no FIFO assigned, and this leads to
an Oops due to NULL dereference. The fix is just to add a proper NULL
check.
We encountered a panic on boot in ipmi_si on a dell per320 due to an
uninitialized timer as follows.
static int smi_start_processing(void *send_info,
ipmi_smi_t intf)
{
/* Try to claim any interrupts. */
if (new_smi->irq_setup)
new_smi->irq_setup(new_smi);
--> IRQ arrives here and irq handler tries to modify uninitialized timer
which triggers BUG_ON(!timer->function) in __mod_timer().
When switch_mm() activates a new PGD, it also sets a bit that
tells other CPUs that the PGD is in use so that TLB flush IPIs
will be sent. In order for that to work correctly, the bit
needs to be visible prior to loading the PGD and therefore
starting to fill the local TLB.
Document all the barriers that make this work correctly and add
a couple that were missing.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit ae535caf02c7e2e7feec62f4e07ac1f48ad5b336)
When decompressing kernel image during x86 bootup, malloc memory
for ELF program headers may run out of heap space, which leads
to system halt. This patch doubles BOOT_HEAP_SIZE to 64KB.
Tested with 32-bit kernel which failed to boot without this patch.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 4b6c3a55305de01d24d994d16ff7dc3dfaee2715)
When restarting a syscall with regs->ax == -ERESTART_RESTARTBLOCK,
regs->ax is assigned to a restart_syscall number. For x32 tasks, this
syscall number must have __X32_SYSCALL_BIT set, otherwise it will be
an x86_64 syscall number instead of a valid x32 syscall number. This
issue has been there since the introduction of x32.
Reported-by: strace/tests/restart_syscall.test Reported-and-tested-by: Elvira Khabirova <lineprinter0@gmail.com> Signed-off-by: Dmitry V. Levin <ldv@altlinux.org> Cc: Elvira Khabirova <lineprinter0@gmail.com> Link: http://lkml.kernel.org/r/20151130215436.GA25996@altlinux.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3ab6a090acfb20dfa49979d21c0a110019ad381f)
rds-stress experiments with request size 256 bytes, 8K acks,
using 16 threads show a 40% improvment when pskb_extract()
replaces the {skb_clone(..); pskb_pull(..); pskb_trim(..);}
pattern in the Rx path, so we leverage the perf gain with
this commit.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
There's a race between keyctl_read() and keyctl_revoke(). If the revoke
happens between keyctl_read() checking the validity of a key and the key's
semaphore being taken, then the key type read method will see a revoked key.
This causes a problem for the user-defined key type because it assumes in
its read method that there will always be a payload in a non-revoked key
and doesn't check for a NULL pointer.
Fix this by making keyctl_read() check the validity of a key after taking
semaphore instead of before.
I think the bug was introduced with the original keyrings code.
This was discovered by a multithreaded test program generated by syzkaller
(http://github.com/google/syzkaller). Here's a cleaned up version:
The commit ba7c95ea3870fe7b847466d39a049ab6f156aa2c ("rhashtable:
Fix sleeping inside RCU critical section in walk_stop") introduced
a new spinlock for the walker list. However, it did not convert
all existing users of the list over to the new spin lock. Some
continued to use the old mutext for this purpose. This obviously
led to corruption of the list.
The fix is to use the spin lock everywhere where we touch the list.
This also allows us to do rcu_rad_lock before we take the lock in
rhashtable_walk_start. With the old mutex this would've deadlocked
but it's safe with the new spin lock.
Fixes: ba7c95ea3870 ("rhashtable: Fix sleeping inside RCU...") Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit ac5966d165a9b4115744269a64ffd3d061b5efe2) Signed-off-by: Dan Duval <dan.duval@oracle.com>
fou->udp_offloads is managed by RCU. As it is actually included inside
the fou sockets, we cannot let the memory go out of scope before a grace
period. We either can synchronize_rcu or switch over to kfree_rcu to
manage the sockets. kfree_rcu seems appropriate as it is used by vxlan
and geneve.
Fixes: 23461551c00628c ("fou: Support for foo-over-udp RX path") Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 2bc1b5b3a634081d49dbb16aab2b55cec1eb55b9) Signed-off-by: Dan Duval <dan.duval@oracle.com>
William Hua <william.hua@canonical.com> wrote:
>
> I wasn't aware there was an enforced minimum size. I simply set the
> nelem_hint in the rhastable_params struct to 1, expecting it to grow as
> needed. This caused a segfault afterwards when trying to insert an
> element.
OK we're doing the size computation before we enforce the limit
on min_size.
---8<---
We need to do the initial hash table size computation after we
have obtained the correct min_size/max_size parameters. Otherwise
we may end up with a hash table whose size is outside the allowed
envelope.
Fixes: a998f712f77e ("rhashtable: Round up/down min/max_size to...") Reported-by: William Hua <william.hua@canonical.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 78e1970233fd270349834bb8b7ca84c58c7c974c) Signed-off-by: Dan Duval <dan.duval@oracle.com>
msg_iocb needs to be initialized on the recv/recvfrom path.
Otherwise afalg will wrongly interpret it as an async call.
Cc: stable@vger.kernel.org Reported-by: Harald Freudenberger <freude@linux.vnet.ibm.com> Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 1c72e110be7f6860ccb4e226beb34e2c4537f6bb) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 1814295365a02ff174948159f1eececa55fe4f97) Signed-off-by: Dan Duval <dan.duval@oracle.com>
This is because we mistake a raw socket as a tcp socket.
We should check both sk->sk_type and sk->sk_protocol to ensure
it is a tcp socket.
Willem points out __skb_complete_tx_timestamp() needs to fix as well.
Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit bd30c2ee7e5dc8258476801a2ec7039a47e2078e) Signed-off-by: Dan Duval <dan.duval@oracle.com>
skb_reorder_vlan_header is called after the vlan header has
been pulled. As a result the offset of the begining of
the mac header has been incrased by 4 bytes (VLAN_HLEN).
When moving the mac addresses, include this incrase in
the offset calcualation so that the mac addresses are
copied correctly.
Fixes: a6e18ff1117 (vlan: Fix untag operations of stacked vlans with REORDER_HEADER off) CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> CC: Patrick McHardy <kaber@trash.net> Signed-off-by: Vladislav Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 2af7df8919a74e4567878b1415f4dd5edbb75163)
Reported-by: Dmitry Vyukov <dvyukov@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit f167b6f4244fbc8d05fcc385b1bf8e70729c9e7c) Signed-off-by: Dan Duval <dan.duval@oracle.com>
David Wilder reported crashes caused by dst reuse.
<quote David>
I am seeing a crash on a distro V4.2.3 kernel caused by a double
release of a dst_entry. In ipv4_dst_destroy() the call to
list_empty() finds a poisoned next pointer, indicating the dst_entry
has already been removed from the list and freed. The crash occurs
18 to 24 hours into a run of a network stress exerciser.
</quote>
Thanks to his detailed report and analysis, we were able to understand
the core issue.
IP early demux can associate a dst to skb, after a lookup in TCP/UDP
sockets.
When socket cache is not properly set, we want to store into
sk->sk_dst_cache the dst for future IP early demux lookups,
by acquiring a stable refcount on the dst.
Problem is this acquisition is simply using an atomic_inc(),
which works well, unless the dst was queued for destruction from
dst_release() noticing dst refcount went to zero, if DST_NOCACHE
was set on dst.
We need to make sure current refcount is not zero before incrementing
it, or risk double free as David reported.
This patch, being a stable candidate, adds two new helpers, and use
them only from IP early demux problematic paths.
It might be possible to merge in net-next skb_dst_force() and
skb_dst_force_safe(), but I prefer having the smallest patch for stable
kernels : Maybe some skb_dst_force() callers do not expect skb->dst
can suddenly be cleared.
Can probably be backported back to linux-3.6 kernels
Reported-by: David J. Wilder <dwilder@us.ibm.com> Tested-by: David J. Wilder <dwilder@us.ibm.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 2dea635ac36ee66cae8a9c001694d0b72dd1d939) Signed-off-by: Dan Duval <dan.duval@oracle.com>
This is because netdev_alloc_skb() fails and 'mdp->rx_skbuff[entry]' is left
NULL but sh_eth_rx() later uses it without checking. Add such check...
Reported-by: Yasushi SHOJI <yashi@atmark-techno.com> Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 8fa4abd57ad4ef9b2ff9beda2eb29368ae10ae9e) Signed-off-by: Dan Duval <dan.duval@oracle.com>
AF_INET, AF_INET6 sockets actually only support 8-bit protocol
identifiers. inet_sock's skc_protocol field thus is sized accordingly,
thus larger protocol identifiers simply cut off the higher bits and
store a zero in the protocol fields.
This could lead to e.g. NULL function pointer because as a result of
the cut off inet_num is zero and we call down to inet_autobind, which
is NULL for raw sockets.
I found no particular commit which introduced this problem.
CVE: CVE-2015-8543 Cc: Cong Wang <cwang@twopensource.com> Reported-by: 郭永刚 <guoyonggang@360.cn> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit bc8f79b522b57ca79a676615003d85b08162ff5a) Signed-off-by: Dan Duval <dan.duval@oracle.com>
SCTP is lacking proper np->opt cloning at accept() time.
TCP and DCCP use ipv6_dup_options() helper, do the same
in SCTP.
We might later factorize this code in a common helper to avoid
future mistakes.
Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 683f40ec958e669b0d376a0d83c0213783ee7f99) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Commit 3511494ce2f3d ("vxlan: Group Policy extension") changed definition of
VXLAN_HF_RCO from 0x00200000 to BIT(24). This is obviously incorrect. It's
also in violation with the RFC draft.
Fixes: 3511494ce2f3d ("vxlan: Group Policy extension") Cc: Thomas Graf <tgraf@suug.ch> Cc: Tom Herbert <therbert@google.com> Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit d599ae203c46fa6de7af4564c3fef8015f3012ca) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Commit 64236f3f3d74 ("ipv6: introduce IFA_F_STABLE_PRIVACY flag")
failed to update the setting of the IFA_F_OPTIMISTIC flag, causing
the IFA_F_STABLE_PRIVACY flag to be lost if IFA_F_OPTIMISTIC is set.
Cc: Erik Kline <ek@google.com> Cc: Fernando Gont <fgont@si6networks.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Fixes: 64236f3f3d74 ("ipv6: introduce IFA_F_STABLE_PRIVACY flag") Signed-off-by: Bjørn Mork <bjorn@mork.no> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit a7a97780df2c5d34f0ab7a9a748708dc5e98a087) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Reported-by: Andrew <nitr0@seti.kr.ua> Fixes: 287f3a943fef ("pppoe: Use workqueue to die properly when a PADT is received") Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 2e7b09211aaadac86fe3ae8e0a94566bfb2ac83d) Signed-off-by: Dan Duval <dan.duval@oracle.com>
When a USB 3.0 mass storage device is disconnected in transporting
state, storage device driver may handle it as a transport error and
reset the device by invoking usb_reset_and_verify_device()
and following could happen:
in usb_reset_and_verify_device():
udev->bos = NULL;
For U1/U2 enabled devices, driver will disable LPM, and in some
conditions:
from usb_unlocked_disable_lpm()
--> usb_disable_lpm()
--> usb_enable_lpm()
udev->bos->ss_cap->bU1devExitLat;
And it causes 'NULL pointer' and 'kernel panic':
[ 157.976257] Unable to handle kernel NULL pointer dereference
at virtual address 00000010
...
[ 158.026400] PC is at usb_enable_link_state+0x34/0x2e0
[ 158.031442] LR is at usb_enable_lpm+0x98/0xac
...
[ 158.137368] [<ffffffc0006a1cac>] usb_enable_link_state+0x34/0x2e0
[ 158.143451] [<ffffffc0006a1fec>] usb_enable_lpm+0x94/0xac
[ 158.148840] [<ffffffc0006a20e8>] usb_disable_lpm+0xa8/0xb4
...
[ 158.214954] Kernel panic - not syncing: Fatal exception
This commit moves 'udev->bos = NULL' behind usb_unlocked_disable_lpm()
to prevent from NULL pointer access.
Issue can be reproduced by following setup:
1) A SS pen drive behind a SS hub connected to the host.
2) Transporting data between the pen drive and the host.
3) Abruptly disconnect hub and pen drive from host.
4) With a chance it crashes.
Signed-off-by: Hans Yang <hansy@nvidia.com> Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit c922602c0d2ae1f435d16b6124cfa1137ce83a7e) Signed-off-by: Dan Duval <dan.duval@oracle.com>
If clp->cl_cb_ident is zero, then nfs_cb_idr_remove_locked() skips removing
it when the nfs_client is freed. A decoding or server bug can then find
and try to put that first nfs_client which would lead to a crash.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Fixes: d6870312659d ("nfs4client: convert to idr_alloc()") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 254cbeb139be712a9dbcfcaea997a3e3dfd9be52) Signed-off-by: Dan Duval <dan.duval@oracle.com>
We've observed the nfsd server in a state where there are
multiple delegations on the same nfs4_file for the same client.
The nfs client does attempt to DELEGRETURN these when they are presented to
it - but apparently under some (unknown) circumstances the client does not
manage to return all of them. This leads to the eventual
attempt to CB_RECALL more than one delegation with the same nfs
filehandle to the same client. The first recall will succeed, but the
next recall will fail with NFS4ERR_BADHANDLE. This leads to the server
having delegations on cl_revoked that the client has no way to FREE
or DELEGRETURN, with resulting inability to recover. The state manager
on the server will continually assert SEQ4_STATUS_RECALLABLE_STATE_REVOKED,
and the state manager on the client will be looping unable to satisfy
the server.
List discussion also reports a race between OPEN and DELEGRETURN that
will be avoided by only sending the delegation once to the
client. This is also logically in accordance with RFC5561 9.1.1 and 10.2.
So, let's:
1.) Not hand out duplicate delegations.
2.) Only send them to the client once.
RFC 5561:
9.1.1:
"Delegations and layouts, on the other hand, are not associated with a
specific owner but are associated with the client as a whole
(identified by a client ID)."
10.2:
"...the stateid for a delegation is associated with a client ID and may be
used on behalf of all the open-owners for the given client. A
delegation is made to the client as a whole and not to any specific
process or thread of control within it."
Reported-by: Eric Meddaugh <etmsys@rit.edu> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Olga Kornievskaia <aglo@umich.edu> Signed-off-by: Andrew Elble <aweits@rit.edu> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3c10560e50f845881cdecfbc22fea7939720ad38) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Andrew was seeing a race occur when an OPEN and OPEN_DOWNGRADE were
running in parallel. The server would receive the OPEN_DOWNGRADE first
and check its seqid, but then an OPEN would race in and bump it. The
OPEN_DOWNGRADE would then complete and bump the seqid again. The result
was that the OPEN_DOWNGRADE would be applied after the OPEN, even though
it should have been rejected since the seqid changed.
The only recourse we have here I think is to serialize operations that
bump the seqid in a stateid, particularly when we're given a seqid in
the call. To address this, we add a new rw_semaphore to the
nfs4_ol_stateid struct. We do a down_write prior to checking the seqid
after looking up the stateid to ensure that nothing else is going to
bump it while we're operating on it.
In the case of OPEN, we do a down_read, as the call doesn't contain a
seqid. Those can run in parallel -- we just need to serialize them when
there is a concurrent OPEN_DOWNGRADE or CLOSE.
LOCK and LOCKU however always take the write lock as there is no
opportunity for parallelizing those.
Reported-and-Tested-by: Andrew W Elble <aweits@rit.edu> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 35b2e295e669a6f6817f51a6f14d821e3f5f7ce0) Signed-off-by: Dan Duval <dan.duval@oracle.com>
If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the
journaling will be aborted first and the error number will be recorded
into JBD2 superblock and, finally, the system will enter into the
panic state in "errors=panic" option. But, in the rare case, this
sequence is little twisted like the below figure and it will happen
that the system enters into panic state, which means the system reset
in mobile environment, before completion of recording an error in the
journal superblock. In this case, e2fsck cannot recognize that the
filesystem failure occurred in the previous run and the corruption
wouldn't be fixed.
There is a use-after-free possibility in __ext4_journal_stop() in the
case that we free the handle in the first jbd2_journal_stop() because
we're referencing handle->h_err afterwards. This was introduced in 9705acd63b125dee8b15c705216d7186daea4625 and it is wrong. Fix it by
storing the handle->h_err value beforehand and avoid referencing
potentially freed handle.
Fixes: 9705acd63b125dee8b15c705216d7186daea4625 Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 456dd91e06dd5194d263c0693811852308c4cc09) Signed-off-by: Dan Duval <dan.duval@oracle.com>
There are times when ext4_bio_write_page() is called even though we
don't actually need to do any I/O. This happens when ext4_writepage()
gets called by the jbd2 commit path when an inode needs to force its
pages written out in order to provide data=ordered guarantees --- and
a page is backed by an unwritten (e.g., uninitialized) block on disk,
or if delayed allocation means the page's backing store hasn't been
allocated yet. In that case, we need to skip the call to
ext4_encrypt_page(), since in addition to wasting CPU, it leads to a
bounce page and an ext4 crypto context getting leaked.
Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b8a7a30104317fd37389b2e2b75cc6f3fa7aef2a) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Commit 4e752f0ab0e8 ("rbd: access snapshot context and mapping size
safely") moved ceph_get_snap_context() out of rbd_img_request_create()
and into rbd_queue_workfn(), adding a ceph_put_snap_context() to the
error path in rbd_queue_workfn(). However, rbd_img_request_create()
consumes a ref on snapc, so calling ceph_put_snap_context() after
a successful rbd_img_request_create() leads to an extra put. Fix it.
If we are using the NO_HOLES feature, we have a tiny time window when
running delalloc for a nodatacow inode where we can race with a concurrent
link or xattr add operation leading to a BUG_ON.
This happens because at run_delalloc_nocow() we end up casting a leaf item
of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a
file extent item (struct btrfs_file_extent_item) and then analyse its
extent type field, which won't match any of the expected extent types
(values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an
explicit BUG_ON(1).
The following sequence diagram shows how the race happens when running a
no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following
neighbour leafs:
(Note the implicit hole for inode 257 regarding the [0, 8K[ range)
CPU 1 CPU 2
run_dealloc_nocow()
btrfs_lookup_file_extent()
--> searches for a key with value
(257 EXTENT_DATA 4096) in the
fs/subvol tree
--> returns us a path with
path->nodes[0] == leaf X and
path->slots[0] == N
because path->slots[0] is >=
btrfs_header_nritems(leaf X), it
calls btrfs_next_leaf()
btrfs_next_leaf()
--> releases the path
hard link added to our inode,
with key (257 INODE_REF 500)
added to the end of leaf X,
so leaf X now has N + 1 keys
--> searches for the key
(257 INODE_REF 256), because
it was the last key in leaf X
before it released the path,
with path->keep_locks set to 1
--> ends up at leaf X again and
it verifies that the key
(257 INODE_REF 256) is no longer
the last key in the leaf, so it
returns with path->nodes[0] ==
leaf X and path->slots[0] == N,
pointing to the new item with
key (257 INODE_REF 500)
the loop iteration of run_dealloc_nocow()
does not break out the loop and continues
because the key referenced in the path
at path->nodes[0] and path->slots[0] is
for inode 257, its type is < BTRFS_EXTENT_DATA_KEY
and its offset (500) is less then our delalloc
range's end (8192)
the item pointed by the path, an inode reference item,
is (incorrectly) interpreted as a file extent item and
we get an invalid extent type, leading to the BUG_ON(1):
This happened because the item we were processing did not match a file
extent item (its key type != BTRFS_EXTENT_DATA_KEY), and even on this
case we cast the item to a struct btrfs_file_extent_item pointer and
then find a type field value that does not match any of the expected
values (BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]). This scenario happens
due to a tiny time window where a race can happen as exemplified below.
For example, consider the following scenario where we're using the
NO_HOLES feature and we have the following two neighbour leafs:
Our inode 257 has an implicit hole in the range [0, 8K[ (implicit rather
than explicit because NO_HOLES is enabled). Now if our inode has an
ordered extent for the range [4K, 8K[ that is finishing, the following
can happen:
CPU 1 CPU 2
btrfs_finish_ordered_io()
insert_reserved_file_extent()
__btrfs_drop_extents()
Searches for the key
(257 EXTENT_DATA 4096) through
btrfs_lookup_file_extent()
Key not found and we get a path where
path->nodes[0] == leaf X and
path->slots[0] == N
Because path->slots[0] is >=
btrfs_header_nritems(leaf X), we call
btrfs_next_leaf()
btrfs_next_leaf() releases the path
inserts key
(257 INODE_REF 4096)
at the end of leaf X,
leaf X now has N + 1 keys,
and the new key is at
slot N
btrfs_next_leaf() searches for
key (257 INODE_REF 256), with
path->keep_locks set to 1,
because it was the last key it
saw in leaf X
finds it in leaf X again and
notices it's no longer the last
key of the leaf, so it returns 0
with path->nodes[0] == leaf X and
path->slots[0] == N (which is now
< btrfs_header_nritems(leaf X)),
pointing to the new key
(257 INODE_REF 4096)
__btrfs_drop_extents() casts the
item at path->nodes[0], slot
path->slots[0], to a struct
btrfs_file_extent_item - it does
not skip keys for the target
inode with a type less than
BTRFS_EXTENT_DATA_KEY
(BTRFS_INODE_REF_KEY < BTRFS_EXTENT_DATA_KEY)
sees a bogus value for the type
field triggering the WARN_ON in
the trace shown above, and sets
extent_end = search_start (4096)
does the if-then-else logic to
fixup 0 length extent items created
by a past bug from hole punching:
if (extent_end == key.offset &&
extent_end >= search_start)
goto delete_extent_item;
that evaluates to true and it ends
up deleting the key pointed to by
path->slots[0], (257 INODE_REF 4096),
from leaf X
The same could happen for example for a xattr that ends up having a key
with an offset value that matches search_start (very unlikely but not
impossible).
So fix this by ensuring that keys smaller than BTRFS_EXTENT_DATA_KEY are
skipped, never casted to struct btrfs_file_extent_item and never deleted
by accident. Also protect against the unexpected case of getting a key
for a lower inode number by skipping that key and issuing a warning.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 51ac5edb612f7890bdce6b67d9c84697aaac96e3) Signed-off-by: Dan Duval <dan.duval@oracle.com>