This reverts commit 5ffd3412ae55
("jffs2: Fix lock acquisition order bug in jffs2_write_begin").
The commit modified jffs2_write_begin() to remove a deadlock with
jffs2_garbage_collect_live(), but this introduced new deadlocks found
by multiple users. page_lock() actually has to be called before
mutex_lock(&c->alloc_sem) or mutex_lock(&f->sem) because
jffs2_write_end() and jffs2_readpage() are called with the page locked,
and they acquire c->alloc_sem and f->sem, resp.
In other words, the lock order in jffs2_write_begin() was correct, and
it is the jffs2_garbage_collect_live() path that has to be changed.
Revert the commit to get rid of the new deadlocks, and to clear the way
for a better fix of the original deadlock.
Reported-by: Deng Chao <deng.chao1@zte.com.cn> Reported-by: Ming Liu <liu.ming50@gmail.com> Reported-by: wangzaiwei <wangzaiwei@top-vision.cn> Signed-off-by: Thomas Betker <thomas.betker@rohde-schwarz.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 74d11976ff45dfe15b8a965d72237ac98533f788)
The contract between try_read() and try_write() is that when called
each processes as much data as possible. When instructed by osd_client
to skip a message, try_read() is violating this contract by returning
after receiving and discarding a single message instead of checking for
more. try_write() then gets a chance to write out more requests,
generating more replies/skips for try_read() to handle, forcing the
messenger into a starvation loop.
Cc: stable@vger.kernel.org # 3.10+ Reported-by: Varada Kari <Varada.Kari@sandisk.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Tested-by: Varada Kari <Varada.Kari@sandisk.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 265570c9977908479db74fd07b710ec5d5c96e12)
The qword_get() function NUL-terminates its output buffer. If the input
string is in hex format \xXXXX... and the same length as the output
buffer, there is an off-by-one:
This patch ensures the NUL terminator doesn't fall outside the output
buffer.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 4ba9f8051f7fc263cc5915fd6c2ac6d9195418b4)
Using request-based DM mpath configured with the following stacking
(.request_fn DM mpath ontop of scsi-mq paths):
echo Y > /sys/module/scsi_mod/parameters/use_blk_mq
echo N > /sys/module/dm_mod/parameters/use_blk_mq
'struct dm_rq_target_io' would leak if a request is requeued before a
blk-mq clone is allocated (or fails to allocate). free_rq_tio()
wasn't being called.
In the case where d_add_unique() finds an appropriate alias to use it will
have already incremented the reference count. An additional dget() to swap
the open context's dentry is unnecessary and will leak a reference.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Fixes: 275bb307865a3 ("NFSv4: Move dentry instantiation into the NFSv4-...") Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 038d8248fc62423b16810878f14d59070a0076cb)
As reported by Soohoon Lee, the HDIO_GET_32BIT ioctl does not
work correctly in compat mode with libata.
I have investigated the issue further and found multiple problems
that all appeared with the same commit that originally introduced
HDIO_GET_32BIT handling in libata back in linux-2.6.8 and presumably
also linux-2.4, as the code uses "copy_to_user(arg, &val, 1)" to copy
a 'long' variable containing either 0 or 1 to user space.
The problems with this are:
* On big-endian machines, this will always write a zero because it
stores the wrong byte into user space.
* In compat mode, the upper three bytes of the variable are updated
by the compat_hdio_ioctl() function, but they now contain
uninitialized stack data.
* The hdparm tool calling this ioctl uses a 'static long' variable
to store the result. This means at least the upper bytes are
initialized to zero, but calling another ioctl like HDIO_GET_MULTCOUNT
would fill them with data that remains stale when the low byte
is overwritten. Fortunately libata doesn't implement any of the
affected ioctl commands, so this would only happen when we query
both an IDE and an ATA device in the same command such as
"hdparm -N -c /dev/hda /dev/sda"
* The libata code for unknown reasons started using ATA_IOC_GET_IO32
and ATA_IOC_SET_IO32 as aliases for HDIO_GET_32BIT and HDIO_SET_32BIT,
while the ioctl commands that were added later use the normal
HDIO_* names. This is harmless but rather confusing.
This addresses all four issues by changing the code to use put_user()
on an 'unsigned long' variable in HDIO_GET_32BIT, like the IDE subsystem
does, and by clarifying the names of the ioctl commands.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reported-by: Soohoon Lee <Soohoon.Lee@f5.com> Tested-by: Soohoon Lee <Soohoon.Lee@f5.com> Cc: stable@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 891985fa8e08d0c625ee71096fc3d10350eb051b)
Building perf out of kernel tree is currently broken because the
MANIFEST file refers to kernel files that have been removed. With this
patch make perf-targz-src-pkg succeeds as does building perf using the
generated tarfile.
Dmitry Vyukov noted recently that the sctp_port_hashtable had an error in
its size computation, observing that the current method never guaranteed
that the hashsize (measured in number of entries) would be a power of two,
which the input hash function for that table requires. The root cause of
the problem is that two values need to be computed (one, the allocation
order of the storage requries, as passed to __get_free_pages, and two the
number of entries for the hash table). Both need to be ^2, but for
different reasons, and the existing code is simply computing one order
value, and using it as the basis for both, which is wrong (i.e. it assumes
that ((1<<order)*PAGE_SIZE)/sizeof(bucket) is still ^2 when its not).
To fix this, we change the logic slightly. We start by computing a goal
allocation order (which is limited by the maximum size hash table we want
to support. Then we attempt to allocate that size table, decreasing the
order until a successful allocation is made. Then, with the resultant
successful order we compute the number of buckets that hash table supports,
which we then round down to the nearest power of two, giving us the number
of entries the table actually supports.
I've tested this locally here, using non-debug and spinlock-debug kernels,
and the number of entries in the hashtable consistently work out to be
powers of two in all cases.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> CC: Dmitry Vyukov <dvyukov@google.com> CC: Vladislav Yasevich <vyasevich@gmail.com> CC: "David S. Miller" <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 00731a62f0f167760ff7a536a30b9eb441794aaf)
An error response from a RTM_GETNETCONF request can return the positive
error value EINVAL in the struct nlmsgerr that can mislead userspace.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 3aa450dcb11d582aab3a6aaf19b67380ab4322bc)
Since the gc of ipv4 route was removed, the route cached would has
no chance to be removed, and even it has been timeout, it still could
be used, cause no code to check it's expires.
Fix this issue by checking and removing route cache when we get route.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 1610a64d0c495cd84d848e5ff4e92f1f9f0ec6e5)
Drop reference on the relay_po socket when __pppoe_xmit() succeeds.
This is already handled correctly in the error path.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit dda12d1bb3534b095c3d3e537e081339e0b7192f)
A previous commit (33f72e6) added notification via netlink for tunnels
when created/modified/deleted. If the notification returned an error,
this error was returned from the tunnel function. If there were no
listeners, the error code ESRCH was returned, even though having no
listeners is not an error. Other calls to this and other similar
notification functions either ignore the error code, or filter ESRCH.
This patch checks for ESRCH and does not flag this as an error.
Reviewed-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz> Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 89afe37040ed95a8959d67efc784a27948f0a76f)
The unix_dgram_sendmsg routine use the following test
if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {
to determine if sk and other are in an n:1 association (either
established via connect or by using sendto to send messages to an
unrelated socket identified by address). This isn't correct as the
specified address could have been bound to the sending socket itself or
because this socket could have been connected to itself by the time of
the unix_peer_get but disconnected before the unix_state_lock(other). In
both cases, the if-block would be entered despite other == sk which
might either block the sender unintentionally or lead to trying to unlock
the same spin lock twice for a non-blocking send. Add a other != sk
check to guard against this.
Fixes: 7d267278a9ec ("unix: avoid use-after-free in ep_remove_wait_queue") Reported-By: Philipp Hahn <pmhahn@pmhahn.de> Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com> Tested-by: Philipp Hahn <pmhahn@pmhahn.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 73fd505d34328b113fb9743688f245991409a830)
Dmitry reported memory leaks of IP options allocated in
ip_cmsg_send() when/if this function returns an error.
Callers are responsible for the freeing.
Many thanks to Dmitry for the report and diagnostic.
Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit ffe8615cbb6abdbb66b8c8b33a3e7ea06c6662f9)
The current logic in bond_arp_rcv will accept an incoming ARP for
validation if (a) the receiving slave is either "active" (which includes
the currently active slave, or the current ARP slave) or, (b) there is a
currently active slave, and it has received an ARP since it became active.
For case (b), the receiving slave isn't the currently active slave, and is
receiving the original broadcast ARP request, not an ARP reply from the
target.
This logic can fail if there is no currently active slave. In
this situation, the ARP probe logic cycles through all slaves, assigning
each in turn as the "current_arp_slave" for one arp_interval, then setting
that one as "active," and sending an ARP probe from that slave. The
current logic expects the ARP reply to arrive on the sending
current_arp_slave, however, due to switch FDB updating delays, the reply
may be directed to another slave.
This can arise if the bonding slaves and switch are working, but
the ARP target is not responding. When the ARP target recovers, a
condition may result wherein the ARP target host replies faster than the
switch can update its forwarding table, causing each ARP reply to be sent
to the previous current_arp_slave. This will never pass the logic in
bond_arp_rcv, as neither of the above conditions (a) or (b) are met.
Some experimentation on a LAN shows ARP reply round trips in the
200 usec range, but my available switches never update their FDB in less
than 4000 usec.
This patch changes the logic in bond_arp_rcv to additionally
accept an ARP reply for validation on any slave if there is a current ARP
slave and it sent an ARP probe during the previous arp_interval.
Fixes: aeea64ac717a ("bonding: don't trust arp requests unless active slave really works") Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 54d0901de6ecdfc72514f5c610fe89d65674c126)
This patch corrects the unaligned accesses seen on GRE TEB tunnels when
generating hash keys. Specifically what this patch does is make it so that
we force the use of skb_copy_bits when the GRE inner headers will be
unaligned due to NET_IP_ALIGNED being a non-zero value.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit f68887de614bcf857651c0d2e24cfc3004dc20e4)
Alexander Duyck [Tue, 9 Feb 2016 10:49:54 +0000 (02:49 -0800)]
flow_dissector: Fix unaligned access in __skb_flow_dissector when used by eth_get_headlen
Orabug: stable_rc4
[ Upstream commit 461547f3158978c180d74484d58e82be9b8e7357, since
we lack the flow dissector flags in this release we guard the
flow label access using a test on 'skb' being NULL ]
This patch fixes an issue with unaligned accesses when using
eth_get_headlen on a page that was DMA aligned instead of being IP aligned.
The fact is when trying to check the length we don't need to be looking at
the flow label so we can reorder the checks to first check if we are
supposed to gather the flow label and then make the call to actually get
it.
v2: Updated path so that either STOP_AT_FLOW_LABEL or KEY_FLOW_LABEL can
cause us to check for the flow label.
Reported-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 916f99656dc7d69355cb1045530c421cb1976590)
Commit ed5a377d87dc ("sctp: translate host order to network order when
setting a hmacid") corrected the hmacid byte-order when setting a hmacid.
but the same issue also exists on getting a hmacid.
We fix it by changing hmacids to host order when users get them with
getsockopt.
Fixes: Commit ed5a377d87dc ("sctp: translate host order to network order when setting a hmacid") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit bb6f76ba4c6b29e4650310982dc61639d34fa833)
tg3_tso_bug() can hit a condition where the entire tx ring is not big
enough to segment the GSO packet. For example, if MSS is very small,
gso_segs can exceed the tx ring size. When we hit the condition, it
will cause tx timeout.
tg3_tso_bug() is called to handle TSO and DMA hardware bugs.
For TSO bugs, if tg3_tso_bug() cannot succeed, we have to drop the packet.
For DMA bugs, we can still fall back to linearize the SKB and let the
hardware transmit the TSO packet.
This patch adds a function tg3_tso_bug_gso_check() to check if there
are enough tx descriptors for GSO before calling tg3_tso_bug().
The caller will then handle the error appropriately - drop or
lineraize the SKB.
v2: Corrected patch description to avoid confusion.
Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Acked-by: Prashant Sreedharan <prashant@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit beb9881442593e9de581bf7d5396788cfedcdbd3)
A rcu stall with the following backtrace was seen on a system with
forwarding, optimistic_dad and use_optimistic set. To reproduce,
set these flags and allow ipv6 autoconf.
This occurs because the device write_lock is acquired while already
holding the read_lock. Back trace below -
v2: do addrconf_dad_kick inside read lock and then acquire write
lock for ipv6_ifa_notify as suggested by Eric
Fixes: 7fd2561e4ebdd ("net: ipv6: Add a sysctl to make optimistic
addresses useful candidates")
Cc: Eric Dumazet <edumazet@google.com> Cc: Erik Kline <ek@google.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 9f88ecf6c707ef3a6e26a8bd58e948728d2093d6)
The commit referenced in the Fixes tag incorrectly accounted the number
of in-flight fds over a unix domain socket to the original opener
of the file-descriptor. This allows another process to arbitrary
deplete the original file-openers resource limit for the maximum of
open files. Instead the sending processes and its struct cred should
be credited.
To do so, we add a reference counted struct user_struct pointer to the
scm_fp_list and use it to account for the number of inflight unix fds.
Fixes: 712f4aad406bb1 ("unix: properly account for FDs passed over unix sockets") Reported-by: David Herrmann <dh.herrmann@gmail.com> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 797c009c98dbb21127a2549d1106ed19d18661cf)
Silence lockdep false positive about rcu_dereference() being
used in the wrong context.
First one should use rcu_dereference_protected() as we own the spinlock.
Second one should be a normal assignation, as no barrier is needed.
Fixes: 18367681a10bd ("ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.") Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 6c92a8f0502d3e9a9036473408f7f67b58f22ea9)
Currently, the egress interface index specified via IPV6_PKTINFO
is ignored by __ip6_datagram_connect(), so that RFC 3542 section 6.7
can be subverted when the user space application calls connect()
before sendmsg().
Fix it by initializing properly flowi6_oif in connect() before
performing the route lookup.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit d04aa950dbd249222f87c96fac9c547890524f5a)
The current implementation of ip6_dst_lookup_tail basically
ignore the egress ifindex match: if the saddr is set,
ip6_route_output() purposefully ignores flowi6_oif, due
to the commit d46a9d678e4c ("net: ipv6: Dont add RT6_LOOKUP_F_IFACE
flag if saddr set"), if the saddr is 'any' the first route lookup
in ip6_dst_lookup_tail fails, but upon failure a second lookup will
be performed with saddr set, thus ignoring the ifindex constraint.
This commit adds an output route lookup function variant, which
allows the caller to specify lookup flags, and modify
ip6_dst_lookup_tail() to enforce the ifindex match on the second
lookup via said helper.
ip6_route_output() becames now a static inline function build on
top of ip6_route_output_flags(); as a side effect, out-of-tree
modules need now a GPL license to access the output route lookup
functionality.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 0c6943f31317d02178f55b882e36e03f4e5795f7)
With some combinations of user provided flags in netlink command,
it is possible to call tcp_get_info() with a buffer that is not 8-bytes
aligned.
It does matter on some arches, so we need to use put_unaligned() to
store the u64 fields.
Current iproute2 package does not trigger this particular issue.
Fixes: 0df48c26d841 ("tcp: add tcpi_bytes_acked to tcp_info") Fixes: 977cb0ecf82e ("tcp: add pacing_rate information into tcp_info") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 1987c92ad0381beab46a65f79827d650de533584)
When switchdev drivers process FDB notifications from the underlying
device they resolve the netdev to which the entry points to and notify
the bridge using the switchdev notifier.
However, since the RTNL mutex is not held there is nothing preventing
the netdev from disappearing in the middle, which will cause
br_switchdev_event() to dereference a non-existing netdev.
Make switchdev drivers hold the lock at the beginning of the
notification processing session and release it once it ends, after
notifying the bridge.
Also, remove switchdev_mutex and fdb_lock, as they are no longer needed
when RTNL mutex is held.
Fixes: 03bf0c281234 ("switchdev: introduce switchdev notifier") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 4c62ddf7e39943b356157b4c2196a71825c8328c)
In 'commit 7fe8097cef5f ("tipc: fix nullpointer bug when subscribing
to events")', we terminate the connection if the subscription
creation fails.
In the same commit, the subscription creation result was based on
the value of the subscription pointer (set in the function) instead
of the return code.
Unfortunately, the same function tipc_subscrp_create() handles
subscription cancel request. For a subscription cancellation request,
the subscription pointer cannot be set. Thus if a subscriber has
several subscriptions and cancels any of them, the connection is
terminated.
In this commit, we terminate the connection based on the return value
of tipc_subscrp_create(). Fixes: commit 7fe8097cef5f ("tipc: fix nullpointer bug when subscribing to events") Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 257a04782588b4b97ac005a044f85947f792be44)
This patch extends commit b93d6471748d ("sctp: implement the sender side
for SACK-IMMEDIATELY extension") as it didn't white list
SCTP_SACK_IMMEDIATELY on sctp_msghdr_parse(), causing it to be
understood as an invalid flag and returning -EINVAL to the application.
Note that the actual handling of the flag is already there in
sctp_datamsg_from_user().
https://tools.ietf.org/html/rfc7053#section-7
Fixes: b93d6471748d ("sctp: implement the sender side for SACK-IMMEDIATELY extension") Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 6e13e7b0183e0b9945ff350bd5fad573c2ece83c)
Several times already this has been reported as kasan reports caused by
syzkaller and trinity and people always looked at RCU races, but it is
much more simple. :)
In case we bind a pptp socket multiple times, we simply add it to
the callid_sock list but don't remove the old binding. Thus the old
socket stays in the bucket with unused call_id indexes and doesn't get
cleaned up. This causes various forms of kasan reports which were hard
to pinpoint.
Simply don't allow multiple binds and correct error handling in
pptp_bind. Also keep sk_state bits in place in pptp_connect.
Fixes: 00959ade36acad ("PPTP: PPP over IPv4 (Point-to-Point Tunneling Protocol)") Cc: Dmitry Kozlov <xeb@mail.ru> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Dmitry Vyukov <dvyukov@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Dave Jones <davej@codemonkey.org.uk> Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit ce28c3ced53aa6385eafeabaeb9c70eca9e3b1ba)
Current wq_pool_mutex doesn't proctect the attrs-installation, it results
that ->unbound_attrs, ->numa_pwq_tbl[] and ->dfl_pwq can only be accessed
under wq->mutex and causes some inconveniences. Example, wq_update_unbound_numa()
has to acquire wq->mutex before fetching the wq->unbound_attrs->no_numa
and the old_pwq.
attrs-installation is a short operation, so this change will no cause any
latency for other operations which also acquire the wq_pool_mutex.
The only unprotected attrs-installation code is in apply_workqueue_attrs(),
so this patch touches code less than comments.
It is also a preparation patch for next several patches which read
wq->unbound_attrs, wq->numa_pwq_tbl[] and wq->dfl_pwq with
only wq_pool_mutex held.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 5b95e1af8d17d85a17728f6de7dbff538e6e3c49)
Current apply_workqueue_attrs() includes pwqs-allocation and pwqs-installation,
so when we batch multiple apply_workqueue_attrs()s as a transaction, we can't
ensure the transaction must succeed or fail as a complete unit.
To solve this, we split apply_workqueue_attrs() into three stages.
The first stage does the preparation: allocation memory, pwqs.
The second stage does the attrs-installaion and pwqs-installation.
The third stage frees the allocated memory and (old or unused) pwqs.
As the result, batching multiple apply_workqueue_attrs()s can
succeed or fail as a complete unit:
1) batch do all the first stage for all the workqueues
2) only commit all when all the above succeed.
This patch is a preparation for the next patch ("Allow modifying low level
unbound workqueue cpumask") which will do a multiple apply_workqueue_attrs().
The patch doesn't have functionality changed except two minor adjustment:
1) free_unbound_pwq() for the error path is removed, we use the
heavier version put_pwq_unlocked() instead since the error path
is rare. this adjustment simplifies the code.
2) the memory-allocation is also moved into wq_pool_mutex.
this is needed to avoid to do the further splitting.
tj: minor updates to comments.
Suggested-by: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Kevin Hilman <khilman@linaro.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Mike Galbraith <bitbucket@online.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
(cherry picked from commit 2d5f0764b5264d2954ba6e3deb04f4f5de8e4476)
The problem here is the skb we provide to tcp_v4_send_ack() had to
be parked in the backlog of a new TCP fastopen child because this child
was owned by the user at the time an out of window packet arrived.
Before queuing a packet, TCP has to set skb->dev to NULL as the device
could disappear before packet is removed from the queue.
Fix this issue by using the net pointer provided by the socket (being a
timewait or a request socket).
IPv6 is immune to the bug : tcp_v6_send_response() already gets the net
pointer from the socket if provided.
Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path") Reported-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Jerry Chu <hkchu@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 93493f5aaff542f0d542c48a3fda2dea1afe5e9e)
Competing overwrite DIO in dioread_nolock mode will just overwrite
pointer to io_end in the inode. This may result in data corruption or
extent conversion happening from IO completion interrupt because we
don't properly set buffer_defer_completion() when unlocked DIO races
with locked DIO to unwritten extent.
Since unlocked DIO doesn't need io_end for anything, just avoid
allocating it and corrupting pointer from inode for locked DIO.
A cleaner fix would be to avoid these games with io_end pointer from the
inode but that requires more intrusive changes so we leave that for
later.
Cc: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 62adae8f26b918f4403129e355d81301a629b6a2)
remap_file_pages(2) emulation can reach file which represents removed
IPC ID as long as a memory segment is mapped. It breaks expectations of
IPC subsystem.
Test case (rewritten to be more human readable, originally autogenerated
by syzkaller[1]):
Considering Linus' past rants about the (ab)use of BUG in the kernel, I
took a look at how we deal with such calls in ipc. Given that any errors
or corruption in ipc code are most likely contained within the set of
processes participating in the broken mechanisms, there aren't really many
strong fatal system failure scenarios that would require a BUG call.
Also, if something is seriously wrong, ipc might not be the place for such
a BUG either.
1. For example, recently, a customer hit one of these BUG_ONs in shm
after failing shm_lock(). A busted ID imho does not merit a BUG_ON,
and WARN would have been better.
2. MSG_COPY functionality of posix msgrcv(2) for checkpoint/restore.
I don't see how we can hit this anyway -- at least it should be IS_ERR.
The 'copy' arg from do_msgrcv is always set by calling prepare_copy()
first and foremost. We could also probably drop this check altogether.
Either way, it does not merit a BUG_ON.
3. No ->fault() callback for the fs getting the corresponding page --
seems selfish to make the system unusable.
The reason is that remap_file_pages() emulation assumes that the target
vma covers whole area we want to over map. That assumption is broken by
first remap_file_pages() call: it split the area into two vma.
The solution is to check next adjacent vmas, if they map the same file
with the same flags.
A non-atomic PCM stream may take snd_pcm_link_rwsem rw semaphore twice
in the same code path, e.g. one in snd_pcm_action_nonatomic() and
another in snd_pcm_stream_lock(). Usually this is OK, but when a
write lock is issued between these two read locks, the problem
happens: the write lock is blocked due to the first reade lock, and
the second read lock is also blocked by the write lock. This
eventually deadlocks.
The reason is the way rwsem manages waiters; it's queued like FIFO, so
even if the writer itself doesn't take the lock yet, it blocks all the
waiters (including reads) queued after it.
As a workaround, in this patch, we replace the standard down_write()
with an spinning loop. This is far from optimal, but it's good
enough, as the spinning time is supposed to be relatively short for
normal PCM operations, and the code paths requiring the write lock
aren't called so often.
A kernel page fault oops with the callstack below was observed
when a read syscall was made to a pmem device after a huge amount
(>512GB) of vmalloc ranges was allocated by ioremap() on a x86_64
system:
Since v4.1, ioremap() supports large page (pud/pmd) mappings in
x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc
range is limited to pte mappings.
vmalloc faults do not normally happen in ioremap'd ranges since
ioremap() sets up the kernel page tables, which are shared by
user processes. pgd_ctor() sets the kernel's PGD entries to
user's during fork(). When allocation of the vmalloc ranges
crosses a 512GB boundary, ioremap() allocates a new pud table
and updates the kernel PGD entry to point it. If user process's
PGD entry does not have this update yet, a read/write syscall
to the range will cause a vmalloc fault, which hits the Oops
above as it does not handle a large page properly.
Following changes are made to vmalloc_fault().
64-bit:
- No change for the PGD sync operation as it handles large
pages already.
- Add pud_huge() and pmd_huge() to the validation code to
handle large pages.
- Change pud_page_vaddr() to pud_pfn() since an ioremap range
is not directly mapped (while the if-statement still works
with a bogus addr).
- Change pmd_page() to pmd_pfn() since an ioremap range is not
backed by struct page (while the if-statement still works
with a bogus addr).
32-bit:
- No change for the sync operation since the index3 PGD entry
covers the entire vmalloc range, which is always valid.
(A separate change to sync PGD entry is necessary if this
memory layout is changed regardless of the page size.)
- Add pmd_huge() to the validation code to handle large pages.
This is for completeness since vmalloc_fault() won't happen
in ioremap'd ranges as its PGD entry is always valid.
Reported-by: Henning Schild <henning.schild@siemens.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Acked-by: Borislav Petkov <bp@alien8.de> Cc: <stable@vger.kernel.org> # 4.1+ Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: linux-mm@kvack.org Cc: linux-nvdimm@lists.01.org Link: http://lkml.kernel.org/r/1455758214-24623-1-git-send-email-toshi.kani@hpe.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 468eeda0837d3e4f1574f800dc460d85e6b49fb7)
The commit [7f0973e973cd: ALSA: seq: Fix lockdep warnings due to
double mutex locks] split the management of two linked lists (source
and destination) into two individual calls for avoiding the AB/BA
deadlock. However, this may leave the possible double deletion of one
of two lists when the counterpart is being deleted concurrently.
It ends up with a list corruption, as revealed by syzkaller fuzzer.
This patch fixes it by checking the list emptiness and skipping the
deletion and the following process.
The tracepoint infrastructure uses RCU sched protection to enable and
disable tracepoints safely. There are some instances where tracepoints are
used in infrastructure code (like kfree()) that get called after a CPU is
going offline, and perhaps when it is coming back online but hasn't been
registered yet.
This can probuce the following warning:
[ INFO: suspicious RCU usage. ] 4.4.0-00006-g0fe53e8-dirty #34 Tainted: G S
-------------------------------
include/trace/events/kmem.h:141 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
RCU used illegally from offline CPU! rcu_scheduler_active = 1, debug_locks = 1
no locks held by swapper/8/0.
This warning is not a false positive either. RCU is not protecting code that
is being executed while the CPU is offline.
Instead of playing "whack-a-mole(TM)" and adding conditional statements to
the tracepoints we find that are used in this instance, simply add a
cpu_online() test to the tracepoint code where the tracepoint will be
ignored if the CPU is offline.
Use of raw_smp_processor_id() is fine, as there should never be a case where
the tracepoint code goes from running on a CPU that is online and suddenly
gets migrated to a CPU that is offline.
When multiple concurrent writes happen on the ALSA sequencer device
right after the open, it may try to allocate vmalloc buffer for each
write and leak some of them. It's because the presence check and the
assignment of the buffer is done outside the spinlock for the pool.
The fix is to move the check and the assignment into the spinlock.
(The current implementation is suboptimal, as there can be multiple
unnecessary vmallocs because the allocation is done before the check
in the spinlock. But the pool size is already checked beforehand, so
this isn't a big problem; that is, the only possible path is the
multiple writes before any pool assignment, and practically seen, the
current coverage should be "good enough".)
Commit 35dc248383bbab0a7203fca4d722875bc81ef091 introduced a check for
current->mm to see if we have a user space context and only copies data
if we do. Now if an IO gets interrupted by a signal data isn't copied
into user space any more (as we don't have a user space context) but
user space isn't notified about it.
This patch modifies the behaviour to return -EINTR from bio_uncopy_user()
to notify userland that a signal has interrupted the syscall, otherwise
it could lead to a situation where the caller may get a buffer with
no data returned.
This can be reproduced by issuing SG_IO ioctl()s in one thread while
constantly sending signals to it.
Fixes: 35dc248 [SCSI] sg: Fix user memory corruption when SG_IO is interrupted by a signal Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Hannes Reinecke <hare@suse.de> Cc: stable@vger.kernel.org # v.3.11+ Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit f3dd341b9230ac943f317efa4c5b1bfae587434b)
This patch fixes vulnerability CVE-2016-2085. The problem exists
because the vm_verify_hmac() function includes a use of memcmp().
Unfortunately, this allows timing side channel attacks; specifically
a MAC forgery complexity drop from 2^128 to 2^12. This patch changes
the memcmp() to the cryptographically safe crypto_memneq().
Reported-by: Xiaofei Rex Guo <xiaofei.rex.guo@intel.com> Signed-off-by: Ryan Ware <ware@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Signed-off-by: James Morris <james.l.morris@oracle.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit d185fa457006e98aa975ed6c0e7d2ddfe3d26695)
I notice ext4/307 fails occasionally on ppc64 host, reporting md5
checksum mismatch after moving data from original file to donor file.
The reason is that move_extent_per_page() calls __block_write_begin()
and block_commit_write() to write saved data from original inode blocks
to donor inode blocks, but __block_write_begin() not only maps buffer
heads but also reads block content from disk if the size is not block
size aligned. At this time the physical block number in mapped buffer
head is pointing to the donor file not the original file, and that
results in reading wrong data to page, which get written to disk in
following block_commit_write call.
This also can be reproduced by the following script on 1k block size ext4
on x86_64 host:
# reserve space for donor file, written by 0xaa and sync to disk to
# avoid EBUSY on EXT4_IOC_MOVE_EXT
xfs_io -fc "pwrite -S 0xaa 0 1m" -c "fsync" $donorfile
# create test file written by 0xbb
xfs_io -fc "pwrite -S 0xbb 0 1023" -c "fsync" $testfile
# compute initial md5sum
md5sum $testfile | tee md5sum.txt
# drop cache, force e4compact to read data from disk
echo 3 > /proc/sys/vm/drop_caches
Fix it by creating & mapping buffer heads only but not reading blocks
from disk, because all the data in page is guaranteed to be up-to-date
in mext_page_mkuptodate().
Since sizeof(ext_new_group_data) > sizeof(ext_new_flex_group_data),
integer overflow could be happened.
Therefore, need to fix integer overflow sanitization.
The value of ctx->pos in the last readdir call is supposed to be set to
INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a
larger value, then it's LLONG_MAX.
There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++"
overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a
64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before
the increment.
We can get to that situation like that:
* emit all regular readdir entries
* still in the same call to readdir, bump the last pos to INT_MAX
* next call to readdir will not emit any entries, but will reach the
bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX
Normally this is not a problem, but if we call readdir again, we'll find
'pos' set to LLONG_MAX and the unconditional increment will overflow.
The report from Victor at
(http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging
print shows that pattern:
The jump from 7fffffff to 7fffffffffffffff happens when new dir entries
are not yet synced and are processed from the delayed list. Then the code
could go to the bump section again even though it might not emit any new
dir entries from the delayed list.
The fix avoids entering the "bump" section again once we've finished
emitting the entries, both for synced and delayed entries.
References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284 Reported-by: Victor <services@swwu.com> CC: stable@vger.kernel.org Signed-off-by: David Sterba <dsterba@suse.com> Tested-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 3a68deb6673cffe2d66f0f0ace615bc0cb4c1153)
When looking up the pool_workqueue to use for an unbound workqueue,
workqueue assumes that the target CPU is always bound to a valid NUMA
node. However, currently, when a CPU goes offline, the mapping is
destroyed and cpu_to_node() returns NUMA_NO_NODE.
This has always been broken but hasn't triggered often enough before 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu").
After the commit, workqueue forcifully assigns the local CPU for
delayed work items without explicit target CPU to fix a different
issue. This widens the window where CPU can go offline while a
delayed work item is pending causing delayed work items dispatched
with target CPU set to an already offlined CPU. The resulting
NUMA_NO_NODE mapping makes workqueue try to queue the work item on a
NULL pool_workqueue and thus crash.
While 874bbfe600a6 has been reverted for a different reason making the
bug less visible again, it can still happen. Fix it by mapping
NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node().
This is a temporary workaround. The long term solution is keeping CPU
-> NODE mapping stable across CPU off/online cycles which is being
worked on.
snd_timer_user_read() has a potential race among parallel reads, as
qhead and qused are updated outside the critical section due to
copy_to_user() calls. Move them into the critical section, and also
sanitize the relevant code a bit.
The hda_jack_tbl entries are managed by snd_array for allowing
multiple jacks. It's good per se, but the problem is that struct
hda_jack_callback keeps the hda_jack_tbl pointer. Since snd_array
doesn't preserve each pointer at resizing the array, we can't keep the
original pointer but have to deduce the pointer at each time via
snd_array_entry() instead. Actually, this resulted in the deference
to the wrong pointer on codecs that have many pins such as CS4208.
This patch replaces the pointer to the NID value as the search key.
As an unexpected good side effect, this even simplifies the code, as
only NID is needed in most cases.
A slave timer element also unlinks at snd_timer_stop() but it takes
only slave_active_lock. When a slave is assigned to a master,
however, this may become a race against the master's interrupt
handling, eventually resulting in a list corruption. The actual bug
could be seen with a syzkaller fuzzer test case in BugLink below.
As a fix, we need to take timeri->timer->lock when timer isn't NULL,
i.e. assigned to a master, while the assignment to a master itself is
protected by slave_active_lock.
In snd_timer_notify1(), the wrong timer instance was passed for slave
ccallback function. This leads to the access to the wrong data when
an incompatible master is handled (e.g. the master is the sequencer
timer and the slave is a user timer), as spotted by syzkaller fuzzer.
The starting node for a klist iteration is often passed in from
somewhere way above the klist infrastructure, meaning there's no
guarantee the node is still on the list. We've seen this in SCSI where
we use bus_find_device() to iterate through a list of devices. In the
face of heavy hotplug activity, the last device returned by
bus_find_device() can be removed before the next call. This leads to
Dec 3 13:22:02 localhost kernel: WARNING: CPU: 2 PID: 28073 at include/linux/kref.h:47 klist_iter_init_node+0x3d/0x50()
Dec 3 13:22:02 localhost kernel: Modules linked in: scsi_debug x86_pkg_temp_thermal kvm_intel kvm irqbypass crc32c_intel joydev iTCO_wdt dcdbas ipmi_devintf acpi_power_meter iTCO_vendor_support ipmi_si imsghandler pcspkr wmi acpi_cpufreq tpm_tis tpm shpchp lpc_ich mfd_core nfsd nfs_acl lockd grace sunrpc tg3 ptp pps_core
Dec 3 13:22:02 localhost kernel: CPU: 2 PID: 28073 Comm: cat Not tainted 4.4.0-rc1+ #2
Dec 3 13:22:02 localhost kernel: Hardware name: Dell Inc. PowerEdge R320/08VT7V, BIOS 2.0.22 11/19/2013
Dec 3 13:22:02 localhost kernel: ffffffff81a20e77ffff880613acfd18ffffffff81321eef0000000000000000
Dec 3 13:22:02 localhost kernel: ffff880613acfd50ffffffff8107ca52ffff88061176b1980000000000000000
Dec 3 13:22:02 localhost kernel: ffffffff814542b0ffff880610cfb100ffff88061176b198ffff880613acfd60
Dec 3 13:22:02 localhost kernel: Call Trace:
Dec 3 13:22:02 localhost kernel: [<ffffffff81321eef>] dump_stack+0x44/0x55
Dec 3 13:22:02 localhost kernel: [<ffffffff8107ca52>] warn_slowpath_common+0x82/0xc0
Dec 3 13:22:02 localhost kernel: [<ffffffff814542b0>] ? proc_scsi_show+0x20/0x20
Dec 3 13:22:02 localhost kernel: [<ffffffff8107cb4a>] warn_slowpath_null+0x1a/0x20
Dec 3 13:22:02 localhost kernel: [<ffffffff8167225d>] klist_iter_init_node+0x3d/0x50
Dec 3 13:22:02 localhost kernel: [<ffffffff81421d41>] bus_find_device+0x51/0xb0
Dec 3 13:22:02 localhost kernel: [<ffffffff814545ad>] scsi_seq_next+0x2d/0x40
[...]
And an eventual crash. It can actually occur in any hotplug system
which has a device finder and a starting device.
We can fix this globally by making sure the starting node for
klist_iter_init_node() is actually a member of the list before using it
(and by starting from the beginning if it isn't).
Reported-by: Ewan D. Milne <emilne@redhat.com> Tested-by: Ewan D. Milne <emilne@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 84ec02eecb3b257f14cb8b10ffcd73d5420c96c0)
Considering current pty code and multiple devpts instances, it's possible
to umount a devpts file system while a program still has /dev/tty opened
pointing to a previosuly closed pty pair in that instance. In the case all
ptmx and pts/N files are closed, umount can be done. If the program closes
/dev/tty after umount is done, devpts_kill_index will use now an invalid
super_block, which was already destroyed in the umount operation after
running ->kill_sb. This is another "use after free" type of issue, but now
related to the allocated super_block instance.
To avoid the problem (warning at ida_remove and potential crashes) for
this specific case, I added two functions in devpts which grabs additional
references to the super_block, which pty code now uses so it makes sure
the super block structure is still valid until pty shutdown is done.
I also moved the additional inode references to the same functions, which
also covered similar case with inode being freed before /dev/tty final
close/shutdown.
Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Cc: stable@vger.kernel.org # 2.6.29+ Reviewed-by: Peter Hurley <peter@hurleysoftware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 38bdf7e3fb81c3d9cbbacab6f96e9fd28ff6c704)
This change fixes a bug for a corner case where we have the the last
release from a pty master/slave coming from a previously opened /dev/tty
file. When this happens, the tty->driver_data can be stale, due to all
ptmx or pts/N files having already been closed before (and thus the inode
related to these files, which tty->driver_data points to, being already
freed/destroyed).
The fix here is to keep a reference on the opened master ptmx inode.
We maintain the inode referenced until the final pty_unix98_shutdown,
and only pass this inode to devpts_kill_index.
Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Cc: <stable@vger.kernel.org> # 2.6.29+ Reviewed-by: Peter Hurley <peter@hurleysoftware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 614f8734d11ad22ee17a5faecf355b70756904ef)
The omap-serial driver emulates RS485 delays using software timers,
but neglects to clamp the input values from the unprivileged
ioctl(TIOCSRS485). Because the software implementation busy-waits,
malicious userspace could stall the cpu for ~49 days.
Some cipher implementations will crash if you try to use them
without calling setkey first. This patch adds a check so that
the accept(2) call will fail with -ENOKEY if setkey hasn't been
done on the socket yet.
Cc: stable@vger.kernel.org Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Dmitry Vyukov <dvyukov@google.com>
[backported to 4.1 by Milan Broz <gmazyland@gmail.com>]
We miss to take the crypto_alg_sem semaphore when traversing the
crypto_alg_list for CRYPTO_MSG_GETALG dumps. This allows a race with
crypto_unregister_alg() removing algorithms from the list while we're
still traversing it, thereby leading to a use-after-free as show below:
To trigger the race run the following loops simultaneously for a while:
$ while : ; do modprobe aesni-intel; rmmod aesni-intel; done
$ while : ; do crconf show all > /dev/null; done
Fix the race by taking the crypto_alg_sem read lock, thereby preventing
crypto_unregister_alg() from modifying the algorithm list during the
dump.
This bug has been detected by the PaX memory sanitize feature.
Cc: stable@vger.kernel.org Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: PaX Team <pageexec@freemail.hu> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 09b09ced1e4e3afb2767ee96edc9d1d57fc1c55b)
sound/pci/hda/patch_hdmi.c:460 hdmi_eld_ctl_get()
error: __memcpy() 'eld->eld_buffer' too small (256 vs 512)
I have a hard time figuring out if this can ever cause an information leak
(I don't think so), but nonetheless it does not hurt to increase the
robustness of the code.
Fixes: 68e03de98507 ('ALSA: hda - hdmi: Do not expose eld data when eld is invalid') Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David Henningsson <david.henningsson@canonical.com> Cc: <stable@vger.kernel.org> # v3.9+ Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 6cf9d4c32e8e2716eb8e9cb578dbcb2b2cad04b3)
This happens because in the code path executed by the inode_paths ioctl we
end up nesting two calls to read lock a leaf's rwlock when after the first
call to read_lock() and before the second call to read_lock(), another
task (running the delayed items as part of a transaction commit) has
already called write_lock() against the leaf's rwlock. This situation is
illustrated by the following diagram:
read_lock(&eb->lock);
--> makes this task hang
forever (and task B too
of course)
So fix this by avoiding doing the nested read lock, which is easily
avoidable. This issue does not happen if task B calls write_lock() after
task A does the second call to read_lock(), however there does not seem
to exist anything in the documentation that mentions what is the expected
behaviour for recursive locking of rwlocks (leaving the idea that doing
so is not a good usage of rwlocks).
Also, as a side effect necessary for this fix, make sure we do not
needlessly read lock extent buffers when the input path has skip_locking
set (used when called from send).
Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 79179a68458a7fddcf35adb16d1f7a7880aa41e7)
This patch fixes a NULL pointer se_cmd->cmd_kref < 0
refcount bug during TMR LUN_RESET with active TMRs,
triggered during se_cmd + se_tmr_req descriptor
shutdown + release via core_tmr_drain_tmr_list().
To address this bug, go ahead and obtain a local
kref_get_unless_zero(&se_cmd->cmd_kref) for active I/O
to set CMD_T_ABORTED, and transport_wait_for_tasks()
followed by the final target_put_sess_cmd() to drop
the local ->cmd_kref.
Also add two new checks within target_tmr_work() to
avoid CMD_T_ABORTED -> TFO->queue_tm_rsp() callbacks
ahead of invoking the backend -> fabric put in
transport_cmd_check_stop_to_fabric().
For good measure, also change core_tmr_release_req()
to use list_del_init() ahead of se_tmr_req memory
free.
Reviewed-by: Quinn Tran <quinn.tran@qlogic.com> Cc: Himanshu Madhani <himanshu.madhani@qlogic.com> Cc: Sagi Grimberg <sagig@mellanox.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Andy Grover <agrover@redhat.com> Cc: Mike Christie <mchristi@redhat.com> Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 422d2ab7e8cc67ed57d79a297f6832c74c0a0d80)
The bulk of ATA host state machine is implemented by
ata_sff_hsm_move(). The function is called from either the interrupt
handler or, if polling, a work item. Unlike from the interrupt path,
the polling path calls the function without holding the host lock and
ata_sff_hsm_move() selectively grabs the lock.
This is completely broken. If an IRQ triggers while polling is in
progress, the two can easily race and end up accessing the hardware
and updating state machine state at the same time. This can put the
state machine in an illegal state and lead to a crash like the
following.
Fix it by ensuring that the polling path is holding the host lock
before entering ata_sff_hsm_move() so that all hardware accesses and
state updates are performed under the host lock.
"count" is controlled by the user and it can be negative. Let's prevent
that by making it unsigned. You have to have CAP_SYS_RAWIO to call this
function so the bug is not as serious as it could be.
Fixes: 5369c02d951a ('intel_scu_ipc: Utility driver for intel scu ipc') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Darren Hart <dvhart@linux.intel.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit abd62631a46c8a27fc607ea5525d3034e8d69781)
A Root Port's AER structure (rpc) contains a queue of events. aer_irq()
enqueues AER status information and schedules aer_isr() to dequeue and
process it. When we remove a device, aer_remove() waits for the queue to
be empty, then frees the rpc struct.
But aer_isr() references the rpc struct after dequeueing and possibly
emptying the queue, which can cause a use-after-free error as in the
following scenario with two threads, aer_isr() on the left and a
concurrent aer_remove() on the right:
Thread A Thread B
-------- --------
aer_irq():
rpc->prod_idx++
aer_remove():
wait_event(rpc->prod_idx == rpc->cons_idx)
# now blocked until queue becomes empty
aer_isr(): # ...
rpc->cons_idx++ # unblocked because queue is now empty
... kfree(rpc)
mutex_unlock(&rpc->rpc_mutex)
To prevent this problem, use flush_work() to wait until the last scheduled
instance of aer_isr() has completed before freeing the rpc struct in
aer_remove().
I reproduced this use-after-free by flashing a device FPGA and
re-enumerating the bus to find the new device. With SLUB debug, this
crashes with 0x6b bytes (POISON_FREE, the use-after-free magic number) in
GPR25:
Helper radix_tree_iter_retry() resets next_index to the current index.
In following radix_tree_next_slot current chunk size becomes zero. This
isn't checked and it tries to dereference null pointer in slot.
Tagged iterator is fine because retry happens only at slot 0 where tag
bitmask in iter->tags is filled with single bit.
Fixes: 46437f9a554f ("radix-tree: fix race in gang lookup") Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ohad Ben-Cohen <ohad@wizery.com> Cc: Jeremiah Mahler <jmmahler@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 42362e1f5a73d36fcaf3eb4950dfc17f633ea0d3)
Sequence vma_lock_anon_vma() - vma_unlock_anon_vma() isn't safe if
anon_vma appeared between lock and unlock. We have to check anon_vma
first or call anon_vma_prepare() to be sure that it's here. There are
only few users of these legacy helpers. Let's get rid of them.
This patch fixes anon_vma lock imbalance in validate_mm(). Write lock
isn't required here, read lock is enough.
And reorders expand_downwards/expand_upwards: security_mmap_addr() and
wrapping-around check don't have to be under anon vma lock.
Link: https://lkml.kernel.org/r/CACT4Y+Y908EjM2z=706dv4rV6dWtxTLK9nFg9_7DhRMLppBo2g@mail.gmail.com Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 4b5eaa857d64ea737c3bed1eb9b3dd201dd2cecf)
When recovery master down, dlm_do_local_recovery_cleanup() only remove
the $RECOVERY lock owned by dead node, but do not clear the refmap bit.
Which will make umount thread falling in dead loop migrating $RECOVERY
to the dead node.
Signed-off-by: xuejiufei <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit ff5ecd2953a635bf70101dac550fcb01ced863e9)
The port is removed synchronously, but the connector delayed.
This causes a use after free which can cause a kernel BUG with
slug_debug=FPZU. This is fixed by freeing the port after the
connector.
In ALSA timer core, the active timer instance is managed in
active_list linked list. Each element is added / removed dynamically
at timer start, stop and in timer interrupt. The problem is that
snd_timer_interrupt() has a thinko and leaves the element in
active_list when it's the last opened element. This eventually leads
to list corruption or use-after-free error.
This hasn't been revealed because we used to delete the list forcibly
in snd_timer_stop() in the past. However, the recent fix avoids the
double-stop behavior (in commit [f784beb75ce8: ALSA: timer: Fix link
corruption due to double start or stop]), and this leak hits reality.
This patch fixes the link management in snd_timer_interrupt(). Now it
simply unlinks no matter which stream is.
xhci driver frees data for all devices, both usb2 and and usb3 the
first time usb_remove_hcd() is called, including td_list and and xhci_ring
structures.
When usb_remove_hcd() is called a second time for the second xhci bus it
will try to dequeue all pending urbs, and touches td_list which is already
freed for that endpoint.
Cc: <stable@vger.kernel.org> Reported-by: Joe Lawrence <joe.lawrence@stratus.com> Tested-by: Joe Lawrence <joe.lawrence@stratus.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 8552bea30320cc866c7fb5d9a687c3e6d15fb5c1)
If the indirect_ptr bit is set on a slot, that indicates we need to redo
the lookup. Introduce a new function radix_tree_iter_retry() which
forces the loop to retry the lookup by setting 'slot' to NULL and
turning the iterator back to point at the problematic entry.
This is a pretty rare problem to hit at the moment; the lookup has to
race with a grow of the radix tree from a height of 0. The consequences
of hitting this race are that gang lookup could return a pointer to a
radix_tree_node instead of a pointer to whatever the user had inserted
in the tree.
Fixes: cebbd29e1c2f ("radix-tree: rewrite gang lookup using iterator") Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ohad Ben-Cohen <ohad@wizery.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit cd3c40afce6cd0209432e6690e60c2156a27c68d)
The port subscription code uses double mutex locks for source and
destination ports, and this may become racy once when wrongly set up.
It leads to lockdep warning splat, typically triggered by fuzzer like
syzkaller, although the actual deadlock hasn't been seen, so far.
This patch simplifies the handling by reducing to two single locks, so
that no lockdep warning will be trigger any longer.
By splitting to two actions, a still-in-progress element shall be
added in one list while handling another. For ignoring this element,
a new check is added in deliver_to_subscribers().
Along with it, the code to add/remove the subscribers list element was
cleaned up and refactored.
The rawmidi read and write functions manage runtime stream status
such as runtime->appl_ptr and runtime->avail. These point where to
copy the new data and how many bytes have been copied (or to be
read). The problem is that rawmidi read/write call copy_from_user()
or copy_to_user(), and the runtime spinlock is temporarily unlocked
and relocked while copying user-space. Since the current code
advances and updates the runtime status after the spin unlock/relock,
the copy and the update may be asynchronous, and eventually
runtime->avail might go to a negative value when many concurrent
accesses are done. This may lead to memory corruption in the end.
For fixing this race, in this patch, the status update code is
performed in the same lock before the temporary unlock. Also, the
spinlock is now taken more widely in snd_rawmidi_kernel_read1() for
protecting more properly during the whole operation.
Although ALSA timer code got hardening for races, it still causes
use-after-free error. This is however rather a corrupted linked list,
not actually the concurrent accesses. Namely, when timer start is
triggered twice, list_add_tail() is called twice, too. This ends
up with the link corruption and triggers KASAN error.
The simplest fix would be replacing list_add_tail() with
list_move_tail(), but fundamentally it's the problem that we don't
check the double start/stop correctly. So, the right fix here is to
add the proper checks to snd_timer_start() and snd_timer_stop() (and
their variants).
This is a minor code cleanup without any functional changes:
- Kill keep_flag argument from _snd_timer_stop(), as all callers pass
only it false.
- Remove redundant NULL check in _snd_timer_stop().
ALSA sequencer may open/close and control ALSA timer instance
dynamically either via sequencer events or direct ioctls. These are
done mostly asynchronously, and it may call still some timer action
like snd_timer_start() while another is calling snd_timer_close().
Since the instance gets removed by snd_timer_close(), it may lead to
a use-after-free.
This patch tries to address such a race by protecting each
snd_timer_*() call via the existing spinlock and also by avoiding the
access to timer during close call.
There are potential deadlocks in PCM OSS emulation code while
accessing read/write and mmap concurrently. This comes from the
infamous mmap_sem usage in copy_from/to_user(). Namely,
NULL user-space buffer can be passed even in a normal path, thus it's
not good to spew a kernel warning with stack trace at each time.
Just drop snd_BUG_ON() macro usage there.
The code within wait_event_interruptible() is called with
!TASK_RUNNING, so mustn't call any functions that can sleep,
like mutex_lock().
Since we re-check the list_empty() in a loop after the wait,
it's safe to simply use list_empty() without locking.
This bug has existed forever, but was only discovered now
because all userspace implementations, including the default
'rfkill' tool, use poll() or select() to get a readable fd
before attempting to read.
During a sw scan ieee80211_iface_work ignores work items for all vifs.
However after the scan complete work is requeued only for STA, ADHOC
and MESH iftypes.
This occasionally results in event processing getting delayed/not
processed for iftype AP when it coexists with a STA. This can result
in data halt and eventually disconnection on the AP interface.
Cc: stable@vger.kernel.org Signed-off-by: Sachin Kulkarni <Sachin.Kulkarni@imgtec.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 8448609ec0b8b36759338905a7a5f620555623d8)
There are a couple of nasty truncation bugs lurking in the pageattr
code that can be triggered when mapping EFI regions, e.g. when we pass
a cpa->pgd pointer. Because cpa->numpages is a 32-bit value, shifting
left by PAGE_SHIFT will truncate the resultant address to 32-bits.
Viorel-Cătălin managed to trigger this bug on his Dell machine that
provides a ~5GB EFI region which requires 1236992 pages to be mapped.
When calling populate_pud() the end of the region gets calculated
incorrectly in the following buggy expression,
end = start + (cpa->numpages << PAGE_SHIFT);
And only 188416 pages are mapped. Next, populate_pud() gets invoked
for a second time because of the loop in __change_page_attr_set_clr(),
only this time no pages get mapped because shifting the remaining
number of pages (1048576) by PAGE_SHIFT is zero. At which point the
loop in __change_page_attr_set_clr() spins forever because we fail to
map progress.
Hitting this bug depends very much on the virtual address we pick to
map the large region at and how many pages we map on the initial run
through the loop. This explains why this issue was only recently hit
with the introduction of commit
a5caa209ba9c ("x86/efi: Fix boot crash by mapping EFI memmap
entries bottom-up at runtime, instead of top-down")
It's interesting to note that safe uses of cpa->numpages do exist in
the pageattr code. If instead of shifting ->numpages we multiply by
PAGE_SIZE, no truncation occurs because PAGE_SIZE is a UL value, and
so the result is unsigned long.
To avoid surprises when users try to convert very large cpa->numpages
values to addresses, change the data type from 'int' to 'unsigned
long', thereby making it suitable for shifting by PAGE_SHIFT without
any type casting.
The alternative would be to make liberal use of casting, but that is
far more likely to cause problems in the future when someone adds more
code and fails to cast properly; this bug was difficult enough to
track down in the first place.