Thomas Tai [Thu, 27 Apr 2017 17:51:48 +0000 (10:51 -0700)]
sparc64: fix cdev_put() use-after-free when unbinding an LDom
After turning on slub_debug=P kernel option, a kernel panic happens when
unbinding an LDom. This suggests that there is memory corruption.
The memory corruption is caused by vlds_fops_release() freeing a memory
structure containing a cdev. The cdev is needed by fs/file_table.c
after the file is released.
The common approach to solve this issue is to add a kobject member
in the structure and set it to be the parent of cdev. The kobject is
then responsible to free the structure when the reference count is
zero. The reference solution is based on the following patch.
Signed-off-by: Thomas Tai <thomas.tai@oracle.com> Reviewed-By: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Tom Saeger <tom.saeger@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
The CCB_EXEC ioctl in the DAX driver returns ENOBUFS when the user must
free completion areas before the submission can succeed. There is a
dax_err() print when this condition occurs. This print should be changed to
a dax_dbg() print since this return value can be used by the caller to
trigger freeing the completion areas, hence an error print is too verbose.
Signed-off-by: Jonathan Helman <jonathan.helman@oracle.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Reviewed-by: Sanath Kumar <sanath.s.kumar@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
Vitaly Kuznetsov [Sun, 1 May 2016 02:21:33 +0000 (19:21 -0700)]
Drivers: hv: kvp: fix IP Failover
Hyper-V VMs can be replicated to another hosts and there is a feature to
set different IP for replicas, it is called 'Failover TCP/IP'. When
such guest starts Hyper-V host sends it KVP_OP_SET_IP_INFO message as soon
as we finish negotiation procedure. The problem is that it can happen (and
it actually happens) before userspace daemon connects and we reply with
HV_E_FAIL to the message. As there are no repetitions we fail to set the
requested IP.
Solve the issue by postponing our reply to the negotiation message till
userspace daemon is connected. We can't wait too long as there is a
host-side timeout (cca. 75 seconds) and if we fail to reply in this time
frame the whole KVP service will become inactive. The solution is not
ideal - if it takes userspace daemon more than 60 seconds to connect
IP Failover will still fail but I don't see a solution with our current
separation between kernel and userspace parts.
Other two modules (VSS and FCOPY) don't require such delay, leave them
untouched.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25970637
(cherry picked from commit 4dbfc2e68004c60edab7e8fd26784383dd3ee9bc) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
K. Y. Srinivasan [Fri, 26 Feb 2016 23:13:19 +0000 (15:13 -0800)]
Drivers: hv: util: Pass the channel information during the init call
Pass the channel information to the util drivers that need to defer
reading the channel while they are processing a request. This would address
the following issue reported by Vitaly:
Commit 3cace4a61610 ("Drivers: hv: utils: run polling callback always in
interrupt context") removed direct *_transaction.state = HVUTIL_READY
assignments from *_handle_handshake() functions introducing the following
race: if a userspace daemon connects before we get first non-negotiation
request from the server hv_poll_channel() won't set transaction state to
HVUTIL_READY as (!channel) condition will fail, we set it to non-NULL on
the first real request from the server.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25970637
(cherry picked from commit b9830d120cbe155863399f25eaef6aa8353e767f) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Olaf Hering [Tue, 15 Dec 2015 00:01:33 +0000 (16:01 -0800)]
Drivers: hv: utils: run polling callback always in interrupt context
All channel interrupts are bound to specific VCPUs in the guest
at the point channel is created. While currently, we invoke the
polling function on the correct CPU (the CPU to which the channel
is bound to) in some cases we may run the polling function in
a non-interrupt context. This potentially can cause an issue as the
polling function can be interrupted by the channel callback function.
Fix the issue by running the polling function on the appropriate CPU
at interrupt level. Additional details of the issue being addressed by
this patch are given below:
Currently hv_fcopy_onchannelcallback is called from interrupts and also
via the ->write function of hv_utils. Since the used global variables to
maintain state are not thread safe the state can get out of sync.
This affects the variable state as well as the channel inbound buffer.
As suggested by KY adjust hv_poll_channel to always run the given
callback on the cpu which the channel is bound to. This avoids the need
for locking because all the util services are single threaded and only
one transaction is active at any given point in time.
Additionally, remove the context variable, they will always be the same as
recv_channel.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25970637
(cherry picked from commit 3cace4a616108539e2730f8dc21a636474395e0f) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
K. Y. Srinivasan [Tue, 15 Dec 2015 00:01:32 +0000 (16:01 -0800)]
Drivers: hv: util: Increase the timeout for util services
Util services such as KVP and FCOPY need assistance from daemon's running
in user space. Increase the timeout so we don't prematurely terminate
the transaction in the kernel. Host sets up a 60 second timeout for
all util driver transactions. The host will retry the transaction if it
times out. Set the guest timeout at 30 seconds.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25970637
(cherry picked from commit c0b200cfb0403740171c7527b3ac71d03f82947a) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Vitaly Kuznetsov [Sat, 1 Aug 2015 23:08:11 +0000 (16:08 -0700)]
Drivers: hv: kvp: check kzalloc return value
kzalloc() return value check was accidentally lost in 11bc3a5fa91f:
"Drivers: hv: kvp: convert to hv_utils_transport" commit.
We don't need to reset kvp_transaction.state here as we have the
kvp_timeout_func() timeout function and in case we're in OOM situation
it is preferable to wait.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25970637
(cherry picked from commit b36fda339729a974a8838978dcdc581d8ce68fd9) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Introduce VSS_OP_REGISTER1 to support kernel replying to the negotiation
message with its own version.
Add small change to vss_handle_handshake for RH compatibility
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Tested-by: Alex Ng <alexng@microsoft.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25970637
(cherry picked from commit cd8dc0548511efff7a97d978f989ce67a883f9a5) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Venkat Venkatsubra [Mon, 8 May 2017 11:23:13 +0000 (04:23 -0700)]
RDS/IB: 4KB receive buffers get posted by mistake on 16KB frag connections.
When connections are at 4KB fragments and then it moves to 16KB frags
(for example during uek2 to uek4 upgrade) we see 4KB buffers getting
posted on 16KB connections. This is happening because the 4KB buffers
(buffers from previous connection before the move to 16KB) are getting
added back to the current connection's (16KB) cache.
We will fix this by doing the following.
1) When the recv buffers get freed/released after either the application
is done reading it or the socket gets closed (process dies, etc.)
and RDS/IB decides to add that buffer back into the current cache,
make sure the frag size matches with that of the current connection.
2) When recv completion reports IB_WC_LOC_LEN_ERR, mark the connection state
as "buffers need to be rebuilt during reconnection". And at the time of
reconnect rebuild the cache even though the "frag size of the connection"
has not changed.
Ajaykumar Hotchandani [Fri, 5 May 2017 19:08:32 +0000 (12:08 -0700)]
mlx4: limit max MSIX allocations
We get more than 64 MSI-X vectors from CX3 firmware 2.35.5530 onwards.
This results in in legacy mode EQ allocs after 64 EQs, which ends up
flooding 3 vectors and causing performance degradation.
With this patch, we limit max vector allocations MAX_MSIX(64).
When Mellanox driver can support more EQs without getting into legacy
mode, this patch should go away.
Peter Zijlstra [Sun, 13 Dec 2015 21:11:16 +0000 (22:11 +0100)]
sched/wait: Fix the signal handling fix
Jan Stancek reported that I wrecked things for him by fixing things for
Vladimir :/
His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which
should not be possible, however my previous patch made this possible by
unconditionally checking signal_pending().
We cannot use current->state as was done previously, because the
instruction after the store to that variable it can be changed. We must
instead pass the initial state along and use that.
Fixes: 68985633bccb ("sched/wait: Fix signal handling in bit wait helpers") Reported-by: Jan Stancek <jstancek@redhat.com> Reported-by: Chris Mason <clm@fb.com> Tested-by: Jan Stancek <jstancek@redhat.com> Tested-by: Vladimir Murzin <vladimir.murzin@arm.com> Tested-by: Chris Mason <clm@fb.com> Reviewed-by: Paul Turner <pjt@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: tglx@linutronix.de Cc: Oleg Nesterov <oleg@redhat.com> Cc: hpa@zytor.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 25908266
(cherry picked from commit dfd01f026058a59a513f8a365b439a0681b803af) Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Eric Dumazet [Wed, 30 Dec 2015 13:51:12 +0000 (08:51 -0500)]
udp: properly support MSG_PEEK with truncated buffers
Backport of this upstream commit into stable kernels : 89c22d8c3b27 ("net: Fix skb csum races when peeking")
exposed a bug in udp stack vs MSG_PEEK support, when user provides
a buffer smaller than skb payload.
In this case,
skb_copy_and_csum_datagram_iovec(skb, sizeof(struct udphdr),
msg->msg_iov);
returns -EFAULT.
This bug does not happen in upstream kernels since Al Viro did a great
job to replace this into :
skb_copy_and_csum_datagram_msg(skb, sizeof(struct udphdr), msg);
This variant is safe vs short buffers.
For the time being, instead reverting Herbert Xu patch and add back
skb->ip_summed invalid changes, simply store the result of
udp_lib_checksum_complete() so that we avoid computing the checksum a
second time, and avoid the problematic
skb_copy_and_csum_datagram_iovec() call.
This patch can be applied on recent kernels as it avoids a double
checksumming, then backported to stable kernels as a bug fix.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 197c949e7798fbf28cfadc69d9ca0c2abbf93191)
Santosh Shilimkar [Wed, 7 Dec 2016 23:06:59 +0000 (15:06 -0800)]
net/mlx4_core: panic the system on unrecoverable errors
Mellanox catastrophic error recovery after device reset doesn't work and
in fact leads to unusable node for IB network since the HCA's ports
go down. At times hard reset is needed to get the system rebooted
which is a real problem in production environment. Once the
network outage detected, unreachable node gets evicted and rebooted
on engineered system using reboot. So hanged reboot command is
problematic. So the idea is let the kernel panic which can recover
system on its own with necessary logs captured. There was a debate
on whether to use panic or machine restart, but it was agreed to use
panic instead of silent reboot since thats the preferred option.
There is Mellanox case open to investigate this issue. As such this
is a rare case scenario and even if the issue is fixed, it is expected
to avoid leading to catas error case. This panic is limited to
only error case.
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Reviewed-by: Mukesh Kacker <mukesh.kacker@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Orabug: 25873690
This is a change taken from QU2, it is not upstream.
(cherry picked from commit 271d694b34bd22e5632eaad41ea1d9a47f1bde3a) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
In forwarding table lookup function, xve_fwt_lookup(),
in case of error condition xve_fwt lock is not released.
This commit fixes this bug by releasing xve_fwt lock on error.
Reviewed-by: Chien Yen <chien.yen@oracle.com> Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Mukesh Kacker [Sun, 12 Feb 2017 00:42:56 +0000 (16:42 -0800)]
mlx4_core: Add func name to common error strings to locate uniquely
We add function names (and where needed line numbers) to
some repeated error strings so we can identify the failure
location uniquely for ease of debugging.
Commit "mlx4_ib: Memory leak on Dom0 with SRIOV" introduced an error,
that the CM message DREQ was silently dropped by the PF passive side,
if the disconnect happened more than 5 seconds after the RTU was
received.
Orabug 25829233 documents that there is memory leak in the mlx4 driver
when the DomUs are destroyed while active. But this patchset does not
influence this leak. The leak is tracked by orabug 25946511.
This commit is a first step to make the uek4 tunneling proxy equal to
upstream and thereafter fix bugs both places.
Commit "mlx4_ib: Memory leak on Dom0 with SRIOV" introduced an error,
that the CM message DREQ was silently dropped by the PF passive side,
if the disconnect happened more than 5 seconds after the RTU was
received.
In order to cleanly revert it, this dependant commit needs to be
reverted as well.
Orabug 25829233 documents that there is memory leak in the mlx4 driver
when the DomUs are destroyed while active. But this patchset does not
influence this leak. The leak is tracked by orabug 25946511.
Note that this commit also included a renaming of a variable. This
will be re-introduced in a later commit.
Convert to hv_utils_transport to support both netlink and /dev/vmbus/hv_vss communication methods.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Tested-by: Alex Ng <alexng@microsoft.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25819105
(cherry picked from commit 6472f80a2eeb34b442542bccd4d600e9251d9c36) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Drivers: hv: vss: switch to using the hvutil_device_state state machine
Switch to using the hvutil_device_state state machine from using kvp_transaction.active.
State transitions are:
-> HVUTIL_DEVICE_INIT when driver loads or on device release
-> HVUTIL_READY if the handshake was successful
-> HVUTIL_HOSTMSG_RECEIVED when there is a non-negotiation message from the host
-> HVUTIL_USERSPACE_REQ after we sent the message to the userspace daemon
-> HVUTIL_USERSPACE_RECV after/if the userspace daemon has replied
-> HVUTIL_READY after we respond to the host
-> HVUTIL_DEVICE_DYING on driver unload
In hv_vss_onchannelcallback() process ICMSGTYPE_NEGOTIATE messages even when
the userspace daemon is disconnected, otherwise we can make the host think
we don't support VSS and disable the service completely.
Unfortunately there is no good way we can figure out that the userspace daemon
has died (unless we start treating all timeouts as such), add a protection
against processing new VSS_OP_REGISTER messages while being in the middle of a
transaction (HVUTIL_USERSPACE_REQ or HVUTIL_USERSPACE_RECV state).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Tested-by: Alex Ng <alexng@microsoft.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25819105
(cherry picked from commit 086a6f68d6933d3c48b3898752cd6ca1a0e02aec) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Drivers: hv: vss: process deferred messages when we complete the transaction
In theory, the host is not supposed to issue any requests before be reply to
the previous one. In KVP we, however, support the following scenarios:
1) A message was received before userspace daemon registered;
2) A message was received while the previous one is still being processed.
In VSS we support only the former. Add support for the later, use
hv_poll_channel() to do the job.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Tested-by: Alex Ng <alexng@microsoft.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25819105
(cherry picked from commit 38c06c29bada78c4805000bfb9b7f19cd691461b) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Convert to hv_utils_transport to support both netlink and /dev/vmbus/hv_kvp communication methods.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Tested-by: Alex Ng <alexng@microsoft.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25819105
(cherry picked from commit 11bc3a5fa91f193b3d947a4cf51e21c4aa13292d) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
There is nothing wrong with coalescing during defragmentation, it
reduces truesize overhead and simplifies things for the receiving
socket (no fraglist walk needed).
However, it also destroys geometry of the original fragments.
While that doesn't cause any breakage (we make sure to not exceed largest
original size) ip_do_fragment contains a 'fastpath' that takes advantage
of a present frag list and results in fragments that (in most cases)
match what was received.
In case its needed the coalescing could be done later, when we're sure
the skb is not forwarded. But discussion during NFWS resulted in
'lets just remove this for now'.
Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 14fe22e334623e451b5592193415c644005461ea)
Kees Cook has pointed out that xfrm_replay_state_esn_len() is subject to
wrapping issues. To ensure we are correctly ensuring that the two ESN
structures are the same size compare both the overall size as reported
by xfrm_replay_state_esn_len() and the internal length are the same.
Signed-off-by: Andy Whitcroft <apw@canonical.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f843ee6dd019bcece3e74e76ad9df0155655d0df) Signed-off-by: Brian Maly <brian.maly@oracle.com>
When a new xfrm state is created during an XFRM_MSG_NEWSA call we
validate the user supplied replay_esn to ensure that the size is valid
and to ensure that the replay_window size is within the allocated
buffer. However later it is possible to update this replay_esn via a
XFRM_MSG_NEWAE call. There we again validate the size of the supplied
buffer matches the existing state and if so inject the contents. We do
not at this point check that the replay_window is within the allocated
memory. This leads to out-of-bounds reads and writes triggered by
netlink packets. This leads to memory corruption and the potential for
priviledge escalation.
We already attempt to validate the incoming replay information in
xfrm_new_ae() via xfrm_replay_verify_len(). This confirms that the user
is not trying to change the size of the replay state buffer which
includes the replay_esn. It however does not check the replay_window
remains within that buffer. Add validation of the contained
replay_window.
Signed-off-by: Andy Whitcroft <apw@canonical.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 677e806da4d916052585301785d847c3b3e6186a) Signed-off-by: Brian Maly <brian.maly@oracle.com>
If lpfc rejects a PRLI that is sent from a target the target will not resend
and will reject the PRLI send from the initiator.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Joe Jin <joe.jin@oracle.com>
Alexander Popov [Tue, 28 Feb 2017 16:54:40 +0000 (19:54 +0300)]
tty: n_hdlc: get rid of racy n_hdlc.tbuf
Currently N_HDLC line discipline uses a self-made singly linked list for
data buffers and has n_hdlc.tbuf pointer for buffer retransmitting after
an error.
The commit be10eb7589337e5defbe214dae038a53dd21add8
("tty: n_hdlc add buffer flushing") introduced racy access to n_hdlc.tbuf.
After tx error concurrent flush_tx_queue() and n_hdlc_send_frames() can put
one data buffer to tx_free_buf_list twice. That causes double free in
n_hdlc_release().
Let's use standard kernel linked list and get rid of n_hdlc.tbuf:
in case of tx error put current data buffer after the head of tx_buf_list.
Jiri Slaby [Thu, 26 Nov 2015 18:28:26 +0000 (19:28 +0100)]
TTY: n_hdlc, fix lockdep false positive
The class of 4 n_hdls buf locks is the same because a single function
n_hdlc_buf_list_init is used to init all the locks. But since
flush_tx_queue takes n_hdlc->tx_buf_list.spinlock and then calls
n_hdlc_buf_put which takes n_hdlc->tx_free_buf_list.spinlock, lockdep
emits a warning:
=============================================
[ INFO: possible recursive locking detected ]
4.3.0-25.g91e30a7-default #1 Not tainted
---------------------------------------------
a.out/1248 is trying to acquire lock:
(&(&list->spinlock)->rlock){......}, at: [<ffffffffa01fd020>] n_hdlc_buf_put+0x20/0x60 [n_hdlc]
but task is already holding lock:
(&(&list->spinlock)->rlock){......}, at: [<ffffffffa01fdc07>] n_hdlc_tty_ioctl+0x127/0x1d0 [n_hdlc]
other info that might help us debug this:
Possible unsafe locking scenario:
Fixes: 376c7311bdb6 ("net: add a temporary sanity check in skb_orphan()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Aniket Alshi <aniket.alshi@oracle.com>
(cherry picked from commit 8b74d439e1697110c5e5c600643e823eb1dd0762) Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Paolo Abeni [Tue, 21 Feb 2017 08:33:18 +0000 (09:33 +0100)]
ip: fix IP_CHECKSUM handling
The skbs processed by ip_cmsg_recv() are not guaranteed to
be linear e.g. when sending UDP packets over loopback with
MSGMORE.
Using csum_partial() on [potentially] the whole skb len
is dangerous; instead be on the safe side and use skb_checksum().
Thanks to syzkaller team to detect the issue and provide the
reproducer.
v1 -> v2:
- move the variable declaration in a tighter scope
Fixes: ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv") Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit ca4ef4574f1ee5252e2cd365f8f5d5bafd048f32)
Eric Dumazet [Mon, 24 Oct 2016 01:03:06 +0000 (18:03 -0700)]
udp: fix IP_CHECKSUM handling
First bug was added in commit ad6f939ab193 ("ip: Add offset parameter to
ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));
Then commit e6afc8ace6dd ("udp: remove headers from UDP packets before
queueing") forgot to adjust the offsets now UDP headers are pulled
before skb are put in receive queue.
Fixes: ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv") Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Sam Kumar <samanthakumar@google.com> Cc: Willem de Bruijn <willemb@google.com> Tested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 10df8e6152c6c400a563a673e9956320bfce1871)
Willem de Bruijn [Thu, 7 Apr 2016 22:12:59 +0000 (18:12 -0400)]
udp: do not expect udp headers in recv cmsg IP_CMSG_CHECKSUM
On udp sockets, recv cmsg IP_CMSG_CHECKSUM returns a checksum over
the packet payload. Since commit e6afc8ace6dd pulled the headers,
taking skb->data as the start of transport header is incorrect. Use
the transport header pointer.
Also, when peeking at an offset from the start of the packet, only
return a checksum from the start of the peeked data. Note that the
cmsg does not subtract a tail checkum when reading truncated data.
Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 31c2e4926fe912f88388bcaa8450fcaa8f2ece47)
Alexander Popov reported that an application may trigger a BUG_ON in
sctp_wait_for_sndbuf if the socket tx buffer is full, a thread is
waiting on it to queue more data and meanwhile another thread peels off
the association being used by the first thread.
This patch replaces the BUG_ON call with a proper error handling. It
will return -EPIPE to the original sendmsg call, similarly to what would
have been done if the association wasn't found in the first place.
Acked-by: Alexander Popov <alex.popov@linux.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 2dcab598484185dea7ec22219c76dcdd59e3cb90)
Darrick J. Wong [Sat, 17 Oct 2015 20:16:02 +0000 (16:16 -0400)]
ext4: store checksum seed in superblock
Allow the filesystem to store the metadata checksum seed in the
superblock and add an incompat feature to say that we're using it.
This enables tune2fs to change the UUID on a mounted metadata_csum
FS without having to (racy!) rewrite all disk metadata.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
(cherry picked from commit 8c81bd8f586c46eaf114758a78d82895a2b081c2)
Eryu Guan [Thu, 1 Dec 2016 20:08:37 +0000 (15:08 -0500)]
ext4: validate s_first_meta_bg at mount time
Ralf Spenneberg reported that he hit a kernel crash when mounting a
modified ext4 image. And it turns out that kernel crashed when
calculating fs overhead (ext4_calculate_overhead()), this is because
the image has very large s_first_meta_bg (debug code shows it's 842150400), and ext4 overruns the memory in count_overhead() when
setting bitmap buffer, which is PAGE_SIZE.
Darrick J. Wong [Sat, 17 Oct 2015 20:18:43 +0000 (16:18 -0400)]
ext4: clean up feature test macros with predicate functions
Create separate predicate functions to test/set/clear feature flags,
thereby replacing the wordy old macros. Furthermore, clean out the
places where we open-coded feature tests.
Paolo Bonzini [Thu, 12 Jan 2017 14:02:32 +0000 (15:02 +0100)]
KVM: x86: fix emulation of "MOV SS, null selector"
This is CVE-2017-2583. On Intel this causes a failed vmentry because
SS's type is neither 3 nor 7 (even though the manual says this check is
only done for usable SS, and the dmesg splat says that SS is unusable!).
On AMD it's worse: svm.c is confused and sets CPL to 0 in the vmcb.
The fix fabricates a data segment descriptor when SS is set to a null
selector, so that CPL and SS.DPL are set correctly in the VMCS/vmcb.
Furthermore, only allow setting SS to a NULL selector if SS.RPL < 3;
this in turn ensures CPL < 3 because RPL must be equal to CPL.
Thanks to Andy Lutomirski and Willy Tarreau for help in analyzing
the bug and deciphering the manuals.
Thomas Tai [Wed, 22 Mar 2017 17:52:11 +0000 (10:52 -0700)]
gfs2: fix slab corruption during mounting and umounting gfs file system
During mounting and unmounting GFS2 file system, kernel panic happens
due to slab memory corruption. The slab allocator suggests that it is
likely a double free memory corrruption. The issue is traced back to
v3.9-rc6 where a patch is submitted to use kzalloc() for storing a
bitmap instead of using a local variable. The intention is to allocate
memory during mounting and to free memory during unmounting. The original
patch misses a code path which has already freed the memory and caused
memory corruption. This patch sets the memory pointer to NULL after
the memory is freed, so that double free memory corruption will not
be happened.
gdlm_mount()
'-- set_recover_size() which use kzalloc()
'-- if dlm does not support ops callbacks then
'--- free_recover_size() which use kfree()
gldm_unmount()
'-- free_recover_size() which use kfree()
previous patch which introduce the double free issue is
commit 57c7310b8eb9 ("GFS2: use kmalloc for lvb bitmap")
Abhi Das [Tue, 5 May 2015 16:26:04 +0000 (11:26 -0500)]
gfs2: handle NULL rgd in set_rgrp_preferences
The function set_rgrp_preferences() does not handle the (rarely
returned) NULL value from gfs2_rgrpd_get_next() and this patch
fixes that.
The fs image in question is only 150MB in size which allows for
only 1 rgrp to be created. The in-memory rb tree has only 1 node
and when gfs2_rgrpd_get_next() is called on this sole rgrp, it
returns NULL. (Default behavior is to wrap around the rb tree and
return the first node to give the illusion of a circular linked
list. In the case of only 1 rgrp, we can't have
gfs2_rgrpd_get_next() return the same rgrp (first, last, next all
point to the same rgrp)... that would cause unintended consequences
and infinite loops.)
Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com>
(cherry picked from upstream commit 959b6717175713259664950f3bba2418b038f69a) Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Peter Zijlstra [Tue, 1 Dec 2015 13:04:04 +0000 (14:04 +0100)]
sched/wait: Fix signal handling in bit wait helpers
Vladimir reported getting RCU stall warnings and bisected it back to
commit:
743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions")
That commit inadvertently reversed the calls to schedule() and signal_pending(),
thereby not handling the case where the signal receives while we sleep.
Reported-by: Vladimir Murzin <vladimir.murzin@arm.com> Tested-by: Vladimir Murzin <vladimir.murzin@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: mark.rutland@arm.com Cc: neilb@suse.de Cc: oleg@redhat.com Fixes: 743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions") Fixes: cbbce8220949 ("SCHED: add some "wait..on_bit...timeout()" interfaces.") Link: http://lkml.kernel.org/r/20151201130404.GL3816@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 68985633bccb6066bf1803e316fbc6c1f5b796d6)
Konrad Rzeszutek Wilk [Sat, 11 Mar 2017 01:24:34 +0000 (20:24 -0500)]
xen-pcifront/hvm: Slurp up "pxm" entry and set NUMA node on PCIe device. (V5)
If the XenBus contains the "pci" (which by default
it does for both PV and HVM guests), then iterate over
all the entries there and see if there are any with "pxm-X"
key. If so those values are used to modify the NUMA locality
information for the PCIe devices that match.
Also support PCIe hotplug - in case this done during runtime.
This patch also depends on the Xen to expose via XenBus the
"pxm-%d" entries.
A bit of background:
_PXM in ACPI is used to tie all kind of ACPI devices to the SRAT
table.
The SRAT table is simple N CPU array that lists APIC IDs and the NUMA nodes
and their distance from each other. There are two types - processor
affinity and memory affinity. For example one can have on a 4 CPU
machine this processor affinity:
APIC_ID | NUMA id (nid)
--------+--------------
0 | 0
2 | 0
4 | 1
6 | 1
The _PXM tie in the NUMA (nid), so for this guest there can only be
two - 0 or 1.
The _PXM can be slapped on most anything in the DSDT, the Processors
(kind of redundant as it is in SRAT), but most importantly for us the
PCIe devices. Except that ACPI does not enumerate all kind of PCIe devices.
Device (PCI0)
{
Name (_HID, EisaId ("PNP0A08") /* PCI Express Bus */) // _HID: Hardware ID
..
Name (_PXM, Zero) // _PXM: Device Proximity
}
Device (PCI1)
{
Name (_HID, EisaId ("PNP0A08") /* PCI Express Bus */) // _HID: Hardware ID
Name (_CID, EisaId ("PNP0A03") /* PCI Bus */) // _CID: Compatible ID
Name (_PXM, 0x01) // _PXM: Device Proximity
}
And this nicely helps with the Linux OS (and Windows) enumerating the
PCIe bridges (the two above) _ONCE_ during bootup. Then when a device
is hotplugged under the bridges it is very clear to which NUMA domain
it belongs.
To recap, on normal hardware Linux scans _ONCE_ the DSDT during
bootup and _only_ evaluates the _PXM on bridges "PNP0A03".
ONCE.
On the QEMU guests that Xen provides we have exactly _one_ bridge.
And the PCIe are hotplugged as 'slots' under it.
The SR-IOV VFs we hot-plug in the guest are done during runtime (not
during bootup, that would be too easy).
This means to make this work we would need to implement in QEMU:
1) Expand piix4 emulation to have bridges, PCIe bridges at bootup.
And bridges also expose the "window" of what the size of the MMIO
region is behind it (and the PCIe devices would fit in there).
2). Create up to NUMA node of these PCI bridges with the _PXM
information.
3). Then during PCI hotplug would decide which bridge based on the
NUMA locality.
That is hard. The 1) is especially difficult as we have no idea
how big MMIO bar the device plugged in will be!
Fortunatly Intel resolved this with the Intel VT-D. It has a hotplug
capability so you can insert a brand new PCIe bridge at any point.
This is how ThunderBolt works in essence.
This would mean that in QEMU we would need to:
4). Emulate in QEMU an IOMMU VT-d with PCI hotplug support.
Recognizing that 1-4 may take some time, and would need to be
done upstream first I decided to take a bit of shortcut.
Mainly that:
1) This only needs to work for ExaData which uses our kernel (UEK)
2) We already expose some of this information on XenBus.
3) Once upstream is done this can be easily 'dropped'.
The 'vdevfn' is the slot:function value. 28 is 00:05.0 and 30
is 00:06:0 and that corresponds to (inside of the guest):
-bash-4.1# lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB Controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Class ff80: XenSource, Inc. Xen Platform Device (rev 01)
00:05.0 USB Controller: NEC Corporation Device 0194 (rev 03)
00:06.0 USB Controller: NEC Corporation Device 0194 (rev 04)
This 'vdevfn' is created by QEMU when the device is hotplugged
(or at bootup time).
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
---
v1: Fixed all the checkpatch.pl issues
Made it respect the node_online() in case the backend provided
a larger value than there are NUMA nodes.
v2: Fixed per Boris's reviews.
v3: Added a mechanism to prune the list when devices are removed.
v4: s/l/len
Added space after 'len' in decleration.
Fixed comments
Added Reviewed-by.
v5: Added Boris's Reviewed-by
There were some problem in the fmr_pool code that either was missing lock
protection or was using wrong lock when allocating/freeing/looking up resource
in the FMR pool.
Covering all above issues, the code turns out that every where we need lock
protection we need both the pool_lock and used_pool_lock. So this patch also
removes the used_pool_lock and keeps the pool lock and make the later sync
all the accesses.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: Avinash Repaka <avinash.repaka@oracle.com>
Prior to commit c0371da6047a ("put iov_iter into msghdr") in v3.19, there
was no check that the iovec contained enough bytes for an ICMP header,
and the read loop would walk across neighboring stack contents. Since the
iov_iter conversion, bad arguments are noticed, but the returned error is
EFAULT. Returning EINVAL is a clearer error and also solves the problem
prior to v3.19.
Reported-by: Qidan He <i@flanker017.me> Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 0eab121ef8750a5c8637d51534d5e9143fb0633f) Signed-off-by: Brian Maly <brian.maly@oracle.com>
The user can control the size of the next command passed along, but the
value passed to the ioctl isn't checked against the usable max command
size.
Cc: <stable@vger.kernel.org> Signed-off-by: Peter Chang <dpf@google.com> Acked-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com>
The commit 90c311b0eeea ("xen-netfront: Fix Rx stall during network
stress and OOM") caused the refill timer to be triggerred almost on
all invocations of xennet_alloc_rx_buffers for certain workloads.
This reworks the fix by reverting to the old behaviour and taking into
consideration the skb allocation failure. Refill timer is now triggered
on insufficient requests or skb allocation failure.
Signed-off-by: Vineeth Remanan Pillai <vineethp@amazon.com> Fixes: 90c311b0eeea (xen-netfront: Fix Rx stall during network stress and OOM) Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Backport from upstream 538d92912d3190a1dd809233a0d57277459f37b2
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Acked-By: Joe Jin <joe.jin@oracle.com>
During an OOM scenario, request slots could not be created as skb
allocation fails. So the netback cannot pass in packets and netfront
wrongly assumes that there is no more work to be done and it disables
polling. This causes Rx to stall.
The issue is with the retry logic which schedules the timer if the
created slots are less than NET_RX_SLOTS_MIN. The count of new request
slots to be pushed are calculated as a difference between new req_prod
and rsp_cons which could be more than the actual slots, if there are
unconsumed responses.
The fix is to calculate the count of newly created slots as the
difference between new req_prod and old req_prod.
Signed-off-by: Vineeth Remanan Pillai <vineethp@amazon.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Backport from upstream 90c311b0eeead647b708a723dbdde1eda3dcad05
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Acked-By: Joe Jin <joe.jin@oracle.com>
The problem is that shmat() calls do_mmap_pgoff() with MAP_FIXED, and
the address rounded down to 0. For the regular mmap case, the
protection mentioned above is that the kernel gets to generate the
address -- arch_get_unmapped_area() will always check for MAP_FIXED and
return that address. So by the time we do security_mmap_addr(0) things
get funky for shmat().
The testcase itself shows that while a regular user crashes, root will
not have a problem attaching a nil-page. There are two possible fixes
to this. The first, and which this patch does, is to simply allow root
to crash as well -- this is also regular mmap behavior, ie when hacking
up the testcase and adding mmap(... |MAP_FIXED). While this approach
is the safer option, the second alternative is to ignore SHM_RND if the
rounded address is 0, thus only having MAP_SHARED flags. This makes the
behavior of shmat() identical to the mmap() case. The downside of this
is obviously user visible, but does make sense in that it maintains
semantics after the round-down wrt 0 address and mmap.
Both damn things interpret userland pointers embedded into the payload;
worse, they are actually traversing those. Leaving aside the bad
API design, this is very much _not_ safe to call with KERNEL_DS.
Bail out early if that happens.
Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 128394eff343fc6d2f32172f03e24829539c5835) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Fixes: 5f74f82ea34c ("net:Add sysctl_max_skb_frags") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Cc: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit ac9e70b17ecd7c6e933ff2eaf7ab37429e71bf4d)
This problem can occur in the following situation:
open()
- pread()
- .seq_start()
- iter = kmalloc() // succeeds
- seqf->private = iter
- .seq_stop()
- kfree(seqf->private)
- pread()
- .seq_start()
- iter = kmalloc() // fails
- .seq_stop()
- class_dev_iter_exit(seqf->private) // boom! old pointer
As the comment in disk_seqf_stop() says, stop is called even if start
failed, so we need to reinitialise the private pointer to NULL when seq
iteration stops.
An alternative would be to set the private pointer to NULL when the
kmalloc() in disk_seqf_start() fails.
Cc: stable@vger.kernel.org Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 77da160530dd1dc94f6ae15a981f24e5f0021e84) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Currently XFS calls file_remove_privs() without holding i_mutex. This is
wrong because that function can end up messing with file permissions and
file capabilities stored in xattrs for which we need i_mutex held.
Fix the problem by grabbing iolock exclusively when we will need to
change anything in permissions / xattrs.
Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Orabug: 24803533 Signed-off-by: darrick.wong@oracle.com
Comment in include/linux/security.h says that ->inode_killpriv() should
be called when setuid bit is being removed and that similar security
labels (in fact this applies only to file capabilities) should be
removed at this time as well. However we don't call ->inode_killpriv()
when we remove suid bit on truncate.
We fix the problem by calling ->inode_need_killpriv() and subsequently
->inode_killpriv() on truncate the same way as we do it on file write.
After this patch there's only one user of should_remove_suid() - ocfs2 -
and indeed it's buggy because it doesn't call ->inode_killpriv() on
write. However fixing it is difficult because of special locking
constraints.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Orabug: 24803533 Signed-off-by: darrick.wong@oracle.com
Provide function telling whether file_remove_privs() will do anything.
Currently we only have should_remove_suid() and that does something
slightly different.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Orabug: 24803533 Signed-off-by: darrick.wong@oracle.com
file_remove_suid() is a misnomer since it removes also file capabilities
stored in xattrs and sets S_NOSEC flag. Also should_remove_suid() tells
something else than whether file_remove_suid() call is necessary which
leads to bugs.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Orabug: 24803533 Signed-off-by: darrick.wong@oracle.com
IB/ipoib: Expose acl_enable sysfs file as read only
This file can be used to determine if ipoib supports IB-ACL.
In debug mode all sysfs files are exposed in full mode.
In non-debug mode only acl_enable is exposed but in real only mode.
Chuck Anderson [Sun, 28 May 2017 02:56:21 +0000 (19:56 -0700)]
Merge branch 'topic/uek-4.1/dtrace' up to bug 26137220 into uek/uek-next
* topic/uek-4.1/dtrace:
ctf: prevent modules on the dedup blacklist from sharing any types at all
ctf: emit bitfields in in-memory order
ctf: bitfield support
ctf: emit file-scope static variables
ctf: speed up the dwarf2ctf duplicate detector some more
ctf: strdup() -> xstrdup()
ctf: speed up the dwarf2ctf duplicate detector
ctf: add module parameter to simple_dwfl_new() and adjust both callers
ctf: fix the size of int and avoid duplicating it
ctf: allow overriding of DIE attributes: use it for parent bias
DTrace tcp/udp provider probes
dtrace: define DTRACE_PROBE_ENABLED to 0 when !CONFIG_DTRACE
dtrace: ensure limit is enforced even when pcs is NULL
dtrace: make x86_64 FBT return probe detection less restrictive
dtrace: support passing offset as arg0 to FBT return probes
dtrace: make FBT entry probe detection less restrictive on x86_64
dtrace: adjust FBT entry probe dection for OL7
Pablo Neira Ayuso [Wed, 9 Dec 2015 21:06:59 +0000 (22:06 +0100)]
netfilter: nf_dup: add missing dependencies with NF_CONNTRACK
CONFIG_NF_CONNTRACK=m
CONFIG_NF_DUP_IPV4=y
results in:
net/built-in.o: In function `nf_dup_ipv4':
>> (.text+0xd434f): undefined reference to `nf_conntrack_untracked'
Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit d3340b79ec8222d20453b1e7f261b017d1d09dc9)
Pablo Neira Ayuso [Sun, 31 May 2015 16:04:11 +0000 (18:04 +0200)]
netfilter: nf_tables: add nft_dup expression
This new expression uses the nf_dup engine to clone packets to a given gateway.
Unlike xt_TEE, we use an index to indicate output interface which should be
fine at this stage.
Moreover, change to the preemtion-safe this_cpu_read(nf_skb_duplicated) from
nf_dup_ipv{4,6} to silence a lockdep splat.
Based on the original tee expression from Arturo Borrero Gonzalez, although
this patch has diverted quite a bit from this initial effort due to the
change to support maps.
Pablo Neira Ayuso [Sun, 31 May 2015 15:54:44 +0000 (17:54 +0200)]
netfilter: factor out packet duplication for IPv4/IPv6
Extracted from the xtables TEE target. This creates two new modules for IPv4
and IPv6 that are shared between the TEE target and the new nf_tables dup
expressions.
This prepares for a TEE like expression in nftables.
We want to ensure only one duplicate is sent, so both will
use the same percpu variable to detect duplication.
The other use case is detection of recursive call to xtables, but since
we don't want dependency from nft to xtables core its put into core.c
instead of the x_tables core.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit e7c8899f3e6f2830136cf6e115c4a55ce7a3920a)
Martin KaFai Lau [Sat, 23 May 2015 03:56:02 +0000 (20:56 -0700)]
ipv6: Set FLOWI_FLAG_KNOWN_NH at flowi6_flags
The neighbor look-up used to depend on the rt6i_gateway (if
there is a gateway) or the rt6i_dst (if it is a RTF_CACHE clone)
as the nexthop address. Note that rt6i_dst is set to fl6->daddr
for the RTF_CACHE clone where fl6->daddr is the one used to do
the route look-up.
Now, we only create RTF_CACHE clone after encountering exception.
When doing the neighbor look-up with a route that is neither a gateway
nor a RTF_CACHE clone, the daddr in skb will be used as the nexthop.
In some cases, the daddr in skb is not the one used to do
the route look-up. One example is in ip_vs_dr_xmit_v6() where the
real nexthop server address is different from the one in the skb.
This patch is going to follow the IPv4 approach and ask the
ip6_pol_route() callers to set the FLOWI_FLAG_KNOWN_NH properly.
In the next patch, ip6_pol_route() will honor the FLOWI_FLAG_KNOWN_NH
and create a RTF_CACHE clone.
Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Julian Anastasov <ja@ssi.bg> Tested-by: Julian Anastasov <ja@ssi.bg> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 48e8aa6e3137692d38f20e8bfff100e408c6bc53)
Jan Kara [Sun, 28 Feb 2016 21:36:38 +0000 (08:36 +1100)]
ext4: Fix data exposure after failed AIO DIO
When AIO DIO fails e.g. due to IO error, we must not convert unwritten
extents as that will expose uninitialized data. Handle this case
by clearing unwritten flag from io_end in case of error and thus
preventing extent conversion.
Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry-picked commit from 74c66bcb7eda551f3b8588659c58fe29184af903)
Christoph Hellwig [Mon, 8 Feb 2016 03:40:51 +0000 (14:40 +1100)]
xfs: fold xfs_vm_do_dio into xfs_vm_direct_IO
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry-picked commit from c19b104a67b3bb1ac48275a8a1c9df666e676c25)
Christoph Hellwig [Mon, 8 Feb 2016 03:40:51 +0000 (14:40 +1100)]
xfs: don't use ioends for direct write completions
We only need to communicate two bits of information to the direct I/O
completion handler:
(1) do we need to convert any unwritten extents in the range
(2) do we need to check if we need to update the inode size based
on the range passed to the completion handler
We can use the private data passed to the get_block handler and the
completion handler as a simple bitmask to communicate this information
instead of the current complicated infrastructure reusing the ioends
from the buffer I/O path, and thus avoiding a memory allocation and
a context switch for any non-trivial direct write. As a nice side
effect we also decouple the direct I/O path implementation from that
of the buffered I/O path.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
(cherry-picked commit from 273dda76f757108bc2b29d30a9595b6dd3bdf3a1)
Orabug: 24393811
Conflicts:
Fixed a merge conflict in xfs_trace.h arised due to absence of
xfs_zero_eof in UEK4.
Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Christoph Hellwig [Fri, 28 Apr 2017 01:27:09 +0000 (18:27 -0700)]
direct-io: always call ->end_io if non-NULL
This way we can pass back errors to the file system, and allow for
cleanup required for all direct I/O invocations.
Also allow the ->end_io handlers to return errors on their own, so that
I/O completion errors can be passed on to the callers.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry-picked commit from 187372a3b9faff68ed61c291d0135e6739e0dbdf)
Orabug: 24393811
Conflicts:
The change in the return type of the function dio_iodone_t
broke the KABI. Hence, the original function return type is wrapped
around under the flag __GENKSYMS__
Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Under certain situations, an incremental send operation can fail due to a
premature attempt to create a new top level inode (a direct child of the
subvolume/snapshot root) whose name collides with another inode that was
removed from the send snapshot.
Consider the following example scenario.
Parent snapshot:
. (ino 256, gen 8)
|---- a1/ (ino 257, gen 9)
|---- a2/ (ino 258, gen 9)
Send snapshot:
. (ino 256, gen 3)
|---- a2/ (ino 257, gen 7)
In this scenario, when receiving the incremental send stream, the btrfs
receive command fails like this (ran in verbose mode, -vv argument):
rmdir a1
mkfile o257-7-0
rename o257-7-0 -> a2
ERROR: rename o257-7-0 -> a2 failed: Is a directory
What happens when computing the incremental send stream is:
1) An operation to remove the directory with inode number 257 and
generation 9 is issued.
2) An operation to create the inode with number 257 and generation 7 is
issued. This creates the inode with an orphanized name of "o257-7-0".
3) An operation rename the new inode 257 to its final name, "a2", is
issued. This is incorrect because inode 258, which has the same name
and it's a child of the same parent (root inode 256), was not yet
processed and therefore no rmdir operation for it was yet issued.
The rename operation is issued because we fail to detect that the
name of the new inode 257 collides with inode 258, because their
parent, a subvolume/snapshot root (inode 256) has a different
generation in both snapshots.
So fix this by ignoring the generation value of a parent directory that
matches a root inode (number 256) when we are checking if the name of the
inode currently being processed collides with the name of some other
inode that was not yet processed.
We can achieve this scenario of different inodes with the same number but
different generation values either by mounting a filesystem with the inode
cache option (-o inode_cache) or by creating and sending snapshots across
different filesystems, like in the following example:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ touch /mnt/a2
$ btrfs subvolume snapshot -r /mnt /mnt/snap2
$ btrfs receive /mnt -f /tmp/1.snap
# Take note that once the filesystem is created, its current
# generation has value 7 so the inode from the second snapshot has
# a generation value of 7. And after receiving the first snapshot
# the filesystem is at a generation value of 10, because the call to
# create the second snapshot bumps the generation to 8 (the snapshot
# creation ioctl does a transaction commit), the receive command calls
# the snapshot creation ioctl to create the first snapshot, which bumps
# the filesystem's generation to 9, and finally when the receive
# operation finishes it calls an ioctl to transition the first snapshot
# (snap1) from RW mode to RO mode, which does another transaction commit
# and bumps the filesystem's generation to 10.
$ rm -f /tmp/1.snap
$ btrfs send /mnt/snap1 -f /tmp/1.snap
$ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.snap
$ umount /mnt
$ mkfs.btrfs -f /dev/sdd
$ mount /dev/sdd /mnt
$ btrfs receive /mnt /tmp/1.snap
# Receive of snapshot snap2 used to fail.
$ btrfs receive /mnt /tmp/2.snap
Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com>
[Rewrote changelog to be more precise and clear] Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from commit 4dd9920d991745c4a16f53a8f615f706fbe4b3f7) Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com> Signed-off-by: Shan Hai <shan.hai@oracle.com>
All the bridges 64-bit resource have pref bit, but the device resource does not
have pref set, then we can not find parent for the device resource,
as we can not put non-pref mmio under pref mmio.
According to pcie spec errta
https://www.pcisig.com/specifications/pciexpress/base2/PCIe_Base_r2.1_Errata_08Jun10.pdf
page 13, in some case it is ok to mark some as pref.
Mark if the entire path from the host to the adapter is over PCI Express.
Set pref compatible bit for claim/sizing/assign for 64bit mem resource
on that pcie device.
-v2: set pref for mmio 64 when whole path is PCI Express, according to David Miller.
-v3: don't set pref directly, change to UNDER_PREF, and set PREF before
sizing and assign resource, and cleart PREF afterwards. requested by BenH.
-v4: use on_all_pcie_path device flag instead.
-v6: update after pci_find_bus_resource() change
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 44e098ad7113adb109fa2d95a29fe2ba9a846efd) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 174fdc115fe527cdcf6b4b99f1e8e6ca4f8062ce) Signed-off-by: Allen Pais <allen.pais@oracle.com>
It turns out that pci_resource_compatible()/pci_up_path_over_pref_mem64()
just check resource with bridge pref mmio register idx 15, and we have put
resource to use mmio register idx 14 during of_scan_pci_bridge()
as the bridge does not have mmio resource.
We already fix pci_up_path_over_pref_mem64() to check all bus resources.
And at the same time, this patch make resource to have consistent sequence
like other arch or directly from pci_read_bridge_bases(),
even when non-pref mmio is missing, or out of ordering in firmware reporting.
Just hold i = 1 for non pref mmio, and i = 2 for pref mmio.
Signed-off-by: Yinghai Lu <yinghai@kernel.org> Tested-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: sparclinux@vger.kernel.org
Orabug: 22855133
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 71c1871c22891102da6303d09b46184361d5f853) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 47b22604dceb59dfc1018ebd0b8def065daa27db) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Yinghai Lu [Sat, 18 Jun 2016 02:24:51 +0000 (19:24 -0700)]
sparc/PCI: Reserve legacy mmio after PCI mmio
On one system found bunch of claim resource fail from pci device.
pci_sun4v f02b894c: PCI host bridge to bus 0000:00
pci_bus 0000:00: root bus resource [io 0x2007e00000000-0x2007e0fffffff] (bus address [0x0000-0xfffffff])
pci_bus 0000:00: root bus resource [mem 0x2000000000000-0x200007effffff] (bus address [0x00000000-0x7effffff])
pci_bus 0000:00: root bus resource [mem 0x2000100000000-0x20007ffffffff] (bus address [0x100000000-0x7ffffffff])
...
PCI: Claiming 0000:00:02.0: Resource 14: 0002000000000000..00020000004fffff [200]
pci 0000:00:02.0: can't claim BAR 14 [mem 0x2000000000000-0x20000004fffff]: address conflict with Video RAM area [??? 0x20000000a0000-0x20000000bffff flags 0x80000000]
pci 0000:02:00.0: can't claim BAR 0 [mem 0x2000000000000-0x20000000fffff]: no compatible bridge window
PCI: Claiming 0000:02:00.0: Resource 3: 0002000000100000..0002000000103fff [200]
pci 0000:02:00.0: can't claim BAR 3 [mem 0x2000000100000-0x2000000103fff]: no compatible bridge window
PCI: Claiming 0000:02:00.1: Resource 0: 0002000000200000..00020000002fffff [200]
pci 0000:02:00.1: can't claim BAR 0 [mem 0x2000000200000-0x20000002fffff]: no compatible bridge window
PCI: Claiming 0000:02:00.1: Resource 3: 0002000000104000..0002000000107fff [200]
pci 0000:02:00.1: can't claim BAR 3 [mem 0x2000000104000-0x2000000107fff]: no compatible bridge window
PCI: Claiming 0000:02:00.2: Resource 0: 0002000000300000..00020000003fffff [200]
pci 0000:02:00.2: can't claim BAR 0 [mem 0x2000000300000-0x20000003fffff]: no compatible bridge window
PCI: Claiming 0000:02:00.2: Resource 3: 0002000000108000..000200000010bfff [200]
pci 0000:02:00.2: can't claim BAR 3 [mem 0x2000000108000-0x200000010bfff]: no compatible bridge window
PCI: Claiming 0000:02:00.3: Resource 0: 0002000000400000..00020000004fffff [200]
pci 0000:02:00.3: can't claim BAR 0 [mem 0x2000000400000-0x20000004fffff]: no compatible bridge window
PCI: Claiming 0000:02:00.3: Resource 3: 000200000010c000..000200000010ffff [200]
pci 0000:02:00.3: can't claim BAR 3 [mem 0x200000010c000-0x200000010ffff]: no compatible bridge window
The bridge 00:02.0 resource does not get reserved as Video RAM take the position early,
and following children resources reservation all fail.
Move down Video RAM area reservation after pci mmio get reserved,
so we leave pci driver to use those regions.
-v5: merge simplify one and use pcibios_bus_to_resource()
-v6: use pci_find_bus_resource()
Signed-off-by: Yinghai Lu <yinghai@kernel.org> Tested-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: sparclinux@vger.kernel.org
Orabug: 22855133
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit bb2c8f32be84cbeec0dd585481c637657e73b05e) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Yinghai Lu [Sat, 18 Jun 2016 02:24:50 +0000 (19:24 -0700)]
PCI: Add pci_find_bus_resource()
Add pci_find_bus_resource() to return bus resource for input resource.
In some case, we may only have bus instead of dev.
It is same as pci_find_parent_resource, but take bus as input.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Orabug: 22855133
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 7f93fc77f440039967ee5a057d49644526a7879f) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Yinghai Lu [Sat, 18 Jun 2016 02:24:49 +0000 (19:24 -0700)]
sparc/PCI: Use correct offset for bus address to resource
After we added 64bit mmio parsing, we got some "no compatible bridge window"
warning on anther new model that support 64bit resource.
It turns out that we can not use mem_space.start as 64bit mem space
offset, aka there is mem_space.start != offset.
Use child_phys_addr to calculate exact offset and record offset in
pbm.
After patch we get correct offset.
/pci@305: PCI IO [io 0x2007e00000000-0x2007e0fffffff] offset 2007e00000000
/pci@305: PCI MEM [mem 0x2000000100000-0x200007effffff] offset 2000000000000
/pci@305: PCI MEM64 [mem 0x2000100000000-0x2000dffffffff] offset 2000000000000
...
pci_sun4v f02ae7f8: PCI host bridge to bus 0000:00
pci_bus 0000:00: root bus resource [io 0x2007e00000000-0x2007e0fffffff] (bus address [0x0000-0xfffffff])
pci_bus 0000:00: root bus resource [mem 0x2000000100000-0x200007effffff] (bus address [0x00100000-0x7effffff])
pci_bus 0000:00: root bus resource [mem 0x2000100000000-0x2000dffffffff] (bus address [0x100000000-0xdffffffff])
-v3: put back mem64_offset, as we found T4 has mem_offset != mem64_offset
check overlapping between mem64_space and mem_space.
-v7: after new pci_mmap_page_range patches.
-v8: remove change in pci_resource_to_user()
Signed-off-by: Yinghai Lu <yinghai@kernel.org> Tested-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: sparclinux@vger.kernel.org
Orabug: 22855133
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit f296714da83b75783997f8dcfe2a9021ef8fedde) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit c34a8c61ea58d516fb20c6ec4fdf338f96fcfeef) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Yinghai Lu [Mon, 25 Jul 2016 22:07:59 +0000 (16:07 -0600)]
PCI: Let pci_mmap_page_range() take resource address
Original pci_mmap_page_range() is taking PCI BAR value aka usr_address.
Bjorn found out that it would be much simple to pass resource address
directly and avoid extra those __pci_mmap_make_offset.
In this patch:
1. in proc path: proc_bus_pci_mmap, try convert back to resource
before calling pci_mmap_page_range
2. in sysfs path: pci_mmap_resource will just offset with resource start.
3. all pci_mmap_page_range will have vma->vm_pgoff with in resource
range instead of BAR value.
4. skip calling __pci_mmap_make_offset, as the checking is done
in pci_mmap_fits().
-v2: add pci_user_to_resource and remove __pci_mmap_make_offset
-v3: pass resource pointer with pci_mmap_page_range()
-v4: put __pci_mmap_make_offset() removing to following patch
seperate /sys io access alignment checking to another patch
updated after Bjorn's pci_resource_to_user() changes.
-v5: update after fix for pci_mmap with proc path accoring to
Bjorn.
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit d6ccd78899c966eec1bac726f96f6cbed5b9960e) Signed-off-by: Allen Pais <allen.pais@oracle.com>
vma->vm_pgoff is above code segment is user/BAR value >> PAGE_SHIFT.
pci_start is resource->start >> PAGE_SHIFT.
For sparc, resource start is different from BAR start aka pci bus address.
pci bus address need to add offset to be the resource start.
So that commit breaks all arch that exposed value is BAR/user value,
and need to be offseted to resource address.
test code using: ./test_mmap_proc /proc/bus/pci/0000:00/04.0 0x2000000
test code segment:
fd = open(argv[1], O_RDONLY);
...
sscanf(argv[2], "0x%lx", &offset);
left = offset & (PAGE_SIZE - 1);
offset &= PAGE_MASK;
ioctl(fd, PCIIOC_MMAP_IS_MEM);
addr = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, fd, offset);
for (i = 0; i < 8; i++)
printf("%x ", addr[i + left]);
munmap(addr, PAGE_SIZE);
close(fd);
Fixes: 8c05cd08a7 ("PCI: fix offset check for sysfs mmapped files") Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Orabug: 22855133
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 7bd8ad7855e6ecdac36c6dc7946f5b6beff13ae2) Signed-off-by: Allen Pais <allen.pais@oracle.com>
PCI: Supply CPU physical address (not bus address) to iomem_is_exclusive()
iomem_is_exclusive() requires a CPU physical address, but on some arches we
supplied a PCI bus address instead.
On most arches, pci_resource_to_user(res) returns "res->start", which is a
CPU physical address. But on microblaze, mips, powerpc, and sparc, it
returns the PCI bus address corresponding to "res->start".
The result is that pci_mmap_resource() may fail when it shouldn't (if the
bus address happens to match an existing resource), or it may succeed when
it should fail (if the resource is exclusive but the bus address doesn't
match it).
Call iomem_is_exclusive() with "res->start", which is always a CPU physical
address, not the result of pci_resource_to_user().
Fixes: e8de1481fd71 ("resource: allow MMIO exclusivity for device drivers") Suggested-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> CC: Arjan van de Ven <arjan@linux.intel.com>
Orabug: 22855133
(cherry picked from commit ca620723d4ff9ea7ed484eab46264c3af871b9ae) Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit aee594b88bfcb58f51e0070de06d3879b3ea4609) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 2dc2632b0b252f8f50c276efa9cc4f00c8e067b5) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 934fffe0fdf1bf81c90be604827f19c4ede839fb) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 59860a19cf9b2acd24a7eaec6d539dbc984a90b9) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit b2fd930d3e95db05979aefc42291be1f6ba2c625) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 09dd990a13d7ea2a257f5be189f109fa80259599) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit de113e58a8cc67f7daa128c33ac49a33935dc767) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit c0691adf4a28993c77371969ff7800901ff1a67b) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit bb9160790d346bcbfc0b5ca701e7e6f6d5d56f87) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 2998a347e93c41d8677e73510a139aa54339c88f) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
(cherry picked from commit 4e49b00da550d255a9470d4ffcffb6dea2227570) Signed-off-by: Allen Pais <allen.pais@oracle.com>