Fixes a direct call to kfree_skb when nlmsg_free should be used.
Fixes: 2ca546b92a02 ('IB/sa: Route SA pathrecord query through netlink') Signed-off-by: Mark Bloch <markb@mellanox.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit 0f377d86252d11bfea941852785e3094b93601a7)
This patch routes a SA pathrecord query to netlink first and processes the
response appropriately. If a failure is returned, the request will be sent
through IB. The decision whether to route the request to netlink first is
determined by the presence of a listener for the local service netlink
multicast group. If the user-space local service netlink multicast group
listener is not present, the request will be sent through IB, just like
what is currently being done.
Signed-off-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: John Fleck <john.fleck@intel.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit 2ca546b92a024d07adedd15b4c262b1c2c0786ec)
This patch adds a function to check if listeners for a netlink multicast
group are present. It also adds a function to receive netlink response
messages.
Signed-off-by: Kaike Wan <kaike.wan@intel.com> Signed-off-by: John Fleck <john.fleck@intel.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit bc10ed7d3d19ff61427007b4d7bf98d3e57bb333)
Due to relaxed ordering requirements on multiple architectures, drivers
are required to use wmb/rmb/mb combinations when they need to guarantee
observability between the memory and the HW.
The mpt3sas driver is already using wmb() for this purpose. However, it
issues a writel following wmb(). writel() function on arm/arm64
arhictectures have an embedded wmb() call inside.
This results in unnecessary performance loss and code duplication.
writel already guarantees ordering for both cpu and bus. we don't need
additional wmb()
Signed-off-by: Sinan Kaya <okaya@codeaurora.org> Acked-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com> Reviewed-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit b1391a5bf83a593bbe92d1f9bddaf563be5c7c9d) Signed-off-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Chaitra P B <chaitra.basappa@broadcom.com> Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 7cfa76963f1872461adff2e84edfbaa8e17d189b) Signed-off-by: Shan Hai <shan.hai@oracle.com>
Small glitch/degraded performance in Crusader is improved with SAS
drives by removing unnecessary spinlocks while clearing scsi command in
drivers internal lookup table.
Signed-off-by: Chaitra P B <chaitra.basappa@broadcom.com> Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 459325c466d278d3c9f51ddc9bb544c014136fd1) Signed-off-by: Shan Hai <shan.hai@oracle.com>
Due existence of loop in the IO path our HBA will receive heavy IOs and
also as driver is not updating the Reply Post Host Index frequently, So
there will be a high chance that our Firmware unable to find any free
entry in the Reply Post Descriptor Queue (i.e. Queue overflow occurs)
and can observe 0x2100 firmware fault. So to fix this, we have defined
a thresh hold value. After continuously processing this thresh hold
number of reply descriptors driver will update the Reply Descriptor Host
Index so that this thresh hold number of reply descriptors entries will
be freed and these entries will be available for firmware and we won't
observe this Firmware fault. We have defined this threshold value as
1/3rd of the hba queue depth.
Signed-off-by: Chaitra P B <chaitra.basappa@broadcom.com> Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 6b4c335a0f6cc61c69cd24f24e40b118bd9f778a) Signed-off-by: Shan Hai <shan.hai@oracle.com>
Driver processes the event MPI26_EVENT_ACTIVE_CABLE_DEGRADED when a
cable is present and is running at a degraded speed (below the SAS3 12
Gb/s rate). Prints added to inform the user that the cable is not
running at optimal speed.
Signed-off-by: Chaitra P B <chaitra.basappa@broadcom.com> Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 6c44c0fe91af7bac78dcaf4c106421862530f499) Signed-off-by: Shan Hai <shan.hai@oracle.com>
Niranjan Patil [Thu, 23 Mar 2017 15:57:24 +0000 (08:57 -0700)]
xen-blkback: report hotplug-status busy when detach is initiated but frontend device is busy.
In case of deferred detach xm/xend doesn't get notified about busy status
and has to wait timeout (default 100s) to report detach failure to user.
This behavior is sometime incorrectly interpreted as tool hang.
This patch updates the hotplug-status with busy so that xm gets notified
instead of timeout.
Joe Carnuccio [Wed, 15 Mar 2017 16:48:43 +0000 (09:48 -0700)]
qla2xxx: Allow vref count to timeout on vport delete.
This commit fixed a panic could be triggered with following steps:
1.create vhba
#virsh nodedev-create vhba.xml
2.destroy vhba
#virsh nodedev-destroy scsi_host9
This is just a screwup for developers, so change it to an ASSERT() so developers
notice when things go wrong and deal with the error appropriately if ASSERT()
isn't enabled. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 3b6571c180da85e43550c608e954ab7b2a31d954) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
With BTRFS_RESERVE_FLUSH_LIMIT, if flush_space is already on the
flush_state of ALLOC_CHUNK and it successfully allocates a new
chunk, then instead of trying to reserve space again,
reserve_metadata_bytes returns 1 immediately.
Eventually the callers who call start_transaction() usually just
do the IS_ERR() check which ERR_PTR(1) can pass, then it'll get
a panic when dereferencing a pointer which is ERR_PTR(1).
The following patch fixes the above problem.
"btrfs: flush_space: treat return value of do_chunk_alloc properly"
https://patchwork.kernel.org/patch/7778651/
This add comments to clarify do_chunk_alloc()'s return value.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 28b737f6ede3661fe610937706c4a6f50e9ab769) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
do_chunk_alloc returns 1 when it succeeds to allocate a new chunk.
But flush_space will not convert this to 0, and will also return 1.
As a result, reserve_metadata_bytes will think that flush_space failed,
and may potentially return this value "1" to the caller (depends how
reserve_metadata_bytes was called). The caller will also treat this as an error.
For example, btrfs_block_rsv_refill does:
int ret = -ENOSPC;
...
ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
if (!ret) {
block_rsv_add_bytes(block_rsv, num_bytes, 0);
return 0;
}
return ret;
So it will return -ENOSPC.
Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit eecba891d38051ebf7f4af6394d188a5fd151a6a) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
At present we perform an xfrm_lookup() for each UDPv6 message we
send. The lookup involves querying the flow cache (flow_cache_lookup)
and, in case of a cache miss, creating an XFRM bundle.
If we miss the flow cache, we can end up creating a new bundle and
deriving the path MTU (xfrm_init_pmtu) from on an already transformed
dst_entry, which we pass from the socket cache (sk->sk_dst_cache) down
to xfrm_lookup(). This can happen only if we're caching the dst_entry
in the socket, that is when we're using a connected UDP socket.
To put it another way, the path MTU shrinks each time we miss the flow
cache, which later on leads to incorrectly fragmented payload. It can
be observed with ESPv6 in transport mode:
1) Set up a transformation and lower the MTU to trigger fragmentation
# ip xfrm policy add dir out src ::1 dst ::1 \
tmpl src ::1 dst ::1 proto esp spi 1
# ip xfrm state add src ::1 dst ::1 \
proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b
# ip link set dev lo mtu 1500
2) Monitor the packet flow and set up an UDP sink
# tcpdump -ni lo -ttt &
# socat udp6-listen:12345,fork /dev/null &
4) Compare it to a non-connected socket
# perl -e 'print "@" x 1500' | socat - udp6-sendto:[::1]:12345
00:00:40.535488 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x6), length 1448
00:00:00.000010 IP6 ::1 > ::1: frag (1448|64)
What happens in step (3) is:
1) when connecting the socket in __ip6_datagram_connect(), we
perform an XFRM lookup, miss the flow cache, create an XFRM
bundle, and cache the destination,
2) afterwards, when sending the datagram, we perform an XFRM lookup,
again, miss the flow cache (due to mismatch of flowi6_iif and
flowi6_oif, which is an issue of its own), and recreate an XFRM
bundle based on the cached (and already transformed) destination.
To prevent the recreation of an XFRM bundle, avoid an XFRM lookup
altogether whenever we already have a destination entry cached in the
socket. This prevents the path MTU shrinkage and brings us on par with
UDPv4.
The fix also benefits connected PINGv6 sockets, another user of
ip6_sk_dst_lookup_flow(), who also suffer messages being transformed
twice.
Joint work with Hannes Frederic Sowa.
Reported-by: Jan Tluka <jtluka@redhat.com> Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 00bc0ef5880dc7b82f9c320dead4afaad48e47be) Signed-off-by: Todd Vierling <todd.vierling@oracle.com> Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Conflicts:
net/ipv6/ip6_output.c
Juergen Gross [Tue, 2 Aug 2016 07:22:12 +0000 (09:22 +0200)]
xen: Make VPMU init message look less scary
The default for the Xen hypervisor is to not enable VPMU in order to
avoid security issues. In this case the Linux kernel will issue the
message "Could not initialize VPMU for cpu 0, error -95" which looks
more like an error than a normal state.
Change the message to something less scary in case the hypervisor
returns EOPNOTSUPP or ENOSYS when trying to activate VPMU.
Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Orabug: 25873416
(cherry picked from commit 0252937a87e1d46a8261da85cbd99dffe612a2d3) Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Dhaval Giani <dhaval.giani@gmail.com>
Jakub Sitnicki [Wed, 26 Oct 2016 09:21:14 +0000 (11:21 +0200)]
ipv6: Don't use ufo handling on later transformed packets
Similar to commit c146066ab802 ("ipv4: Don't use ufo handling on later
transformed packets"), don't perform UFO on packets that will be IPsec
transformed. To detect it we rely on the fact that headerlen in
dst_entry is non-zero only for transformation bundles (xfrm_dst
objects).
Unwanted segmentation can be observed with a NETIF_F_UFO capable device,
such as a dummy device:
DEV=dum0 LEN=1493
ip li add $DEV type dummy
ip addr add fc00::1/64 dev $DEV nodad
ip link set $DEV up
ip xfrm policy add dir out src fc00::1 dst fc00::2 \
tmpl src fc00::1 dst fc00::2 proto esp spi 1
ip xfrm state add src fc00::1 dst fc00::2 \
proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b
Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach") Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit f89c56ce710afa65e1b2ead555b52c4807f34ff7)
When calculating po->tp_hdrlen + po->tp_reserve the result can overflow.
Fix by checking that tp_reserve <= INT_MAX on assign.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit bcc5364bdcfe131e6379363f089e7b4108d35b70) Signed-off-by: Brian Maly <brian.maly@oracle.com>
When calculating rb->frames_per_block * req->tp_block_nr the result
can overflow.
Add a check that tp_block_size * tp_block_nr <= UINT_MAX.
Since frames_per_block <= tp_block_size, the expression would
never overflow.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 8f8d28e4d6d815a391285e121c3a53a0b6cb9e7b) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Subtracting tp_sizeof_priv from tp_block_size and casting to int
to check whether one is less then the other doesn't always work
(both of them are unsigned ints).
Compare them as is instead.
Also cast tp_sizeof_priv to u64 before using BLK_PLUS_PRIV, as
it can overflow inside BLK_PLUS_PRIV otherwise.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 2b6867c2ce76c596676bec7d2d525af525fdc6e2) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Eric Dumazet [Mon, 29 Jun 2015 15:10:30 +0000 (17:10 +0200)]
fs/file.c: __fget() and dup2() atomicity rules
__fget() does lockless fetch of pointer from the descriptor
table, attempts to grab a reference and treats "it was already
zero" as "it's already gone from the table, we just hadn't
seen the store, let's fail". Unfortunately, that breaks the
atomicity of dup2() - __fget() might see the old pointer,
notice that it's been already dropped and treat that as
"it's closed". What we should be getting is either the
old file or new one, depending whether we come before or after
dup2().
Dmitry had following test failing sometimes :
int fd;
void *Thread(void *x) {
char buf;
int n = read(fd, &buf, 1);
if (n != 1)
exit(printf("read failed: n=%d errno=%d\n", n, errno));
return 0;
}
int main()
{
fd = open("/dev/urandom", O_RDONLY);
int fd2 = open("/dev/urandom", O_RDONLY);
if (fd == -1 || fd2 == -1)
exit(printf("open failed\n"));
pthread_t th;
pthread_create(&th, 0, Thread, 0);
if (dup2(fd2, fd) == -1)
exit(printf("dup2 failed\n"));
pthread_join(th, 0);
if (close(fd) == -1)
exit(printf("close failed\n"));
if (close(fd2) == -1)
exit(printf("close failed\n"));
printf("DONE\n");
return 0;
}
Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Orabug: 25408921
From 25408921: Signed-off-by: todd.vierling@oracle.com Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Zhu Yanjun [Thu, 18 May 2017 03:44:12 +0000 (23:44 -0400)]
IB/ipoib: add get_settings in ethtool
In order to let the bonding driver report the correct speed
of the underlaying interfaces, when they are IPoIB, the ethtool
function get_settings() in the IPoIB driver is implemented.
RDS/IB: active bonding port state fix for intfs added late
When new interfaces are added after boot or a late notifier
events cause an interface to be added late, there is need
to make sure port state moves to UP or DOWN (and does not
stay in INIT state) regardless of order of the initialization
of data structures racing with NETDEV notifier events.
Without that subsequent failover/failback processing may
not happen properly as it looks for port_state in
UP or DOWN state.
xsvhba's internally generated scsi command timeout code prematurely completes
a command rather than relying on qlogic to complete with "CMD_TIMEOUT" code.
Actual command completes just after xsigo timeout completion and
causes the freed buffer to be overwritten with inquiry data.
These code changes will allow scsi mid layer to do the recovery.
The original xsigo timeout code is not there in ESX xsvhba source code and was
mistakenly brought over in uek.
659743b02c41 splits iscsi session lock into two locks, one to be used while
sending a request to the target and the other to be used while processing
a response. This patch has caused multiple bugs due to races while
accessing various lists that hold the iscsi_task in the iscsi_conn
structure.
Although commit 6f8830f5bbab in upstream partially fixes the issue, there
is still atleast one regression seen when the same iscsi task is accessed
simultaneously in iscsi_xmit_task() and iscsi_complete_task() which causes
a null pointer dereference and panic.
Its best to revert this patch until we find a permanent solution.
Signed-off-by: Ashish Samant <ashish.samant@oracle.com> Reviewed-by: John Sobecki <john.sobecki@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
The NFSv2/v3 code does not systematically check whether we decode past
the end of the buffer. This generally appears to be harmless, but there
are a few places where we do arithmetic on the pointers involved and
don't account for the possibility that a length could be negative. Add
checks to catch these.
Reported-by: Tuomas Haanpää <thaan@synopsys.com> Reported-by: Ari Kauppi <ari@synopsys.com> Reviewed-by: NeilBrown <neilb@suse.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
(cherry picked from commit 13bf9fbff0e5e099e2b6f003a0ab8ae145436309) Signed-off-by: Brian Maly <brian.maly@oracle.com> Reviewed-by: John Haxby <john.haxby@oracle.com>
Conflicts:
fs/nfsd/nfsxdr.c
Dave Kleikamp [Mon, 15 May 2017 19:14:13 +0000 (14:14 -0500)]
sched/rt: Minimize rq->lock contention in do_sched_rt_period_timer()
With CONFIG_RT_GROUP_SCHED=y, do_sched_rt_period_timer() sequentially
takes each CPU's rq->lock. On a large, busy system, the cumulative time it
takes to acquire each lock can be excessive, even triggering a watchdog
timeout.
If rt_rq->rt_time and rt_rq->rt_nr_running are both zero, this function does
nothing while holding the lock, so don't bother taking it at all.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/a767637b-df85-912f-ba69-c90ee00a3fb6@oracle.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Orabug: 25491970
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
chris hyser [Thu, 18 May 2017 18:18:33 +0000 (12:18 -0600)]
sparc64: cache_line_size() returns larger value for cache line size.
SPARC currently returns L1 data cache line size (as low as 32 bytes on
some systems) though L2 and L3 cache line sizes may be higher. As
cache_line_size() is used by code to align memory requests to prevent
unnecessary cache line sharing, this patch returns the max of L2 and L3
sizes, currently 64 bytes.
Signed-off-by: Chris Hyser <chris.hyser@oracle.com> Reviewed-by: Rob Gardner <rob.gardner@oracle.com> Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Menno Lageman [Tue, 2 May 2017 09:53:53 +0000 (05:53 -0400)]
sparc64: set the ISCNTRLD bit for SP service handles
Service handles generated by the ds driver can collide with service handles
generated by the SP, causing failures with Domain Services on the SP such
as 'ldom_req_sp_token: set-token failed: no reply' errors.
Ensure that service handles generated by the ds driver do not collide
with service handles generated by the SP by setting the ISCNTRLD bit in
the lower half of the service handle for SP Domain Services. This is
similar to what Solaris does.
Rob Gardner [Fri, 19 May 2017 01:14:06 +0000 (19:14 -0600)]
sparc64: DAX recursive lock removed
At some point in the past, the call to get_user_pages() was changed to
get_user_pages_fast(). The former requires that mmap_sem be held when
making the call, which the driver respected. But the latter requires that
mmap_sem not be held, since it acquires it later. So mmap_sem was being
acquired by the driver, then again in get_user_pages_fast(). In between
these two acquisitions, another thread can come along and call mmap(),
which will wait on the same semaphore, and deadlock with the subsequent
get_user_pages_fast() attempt to get it again.
Liam R. Howlett [Wed, 17 May 2017 15:47:00 +0000 (11:47 -0400)]
sparc/ftrace: Fix ftrace graph time measurement
The ftrace function_graph time measurements of a given function is not
accurate according to those recorded by ftrace using the function
filters. This change pulls the x86_64 fix from 'commit 722b3c746953
("ftrace/graph: Trace function entry before updating index")' into the
sparc specific prepare_ftrace_return which stops ftrace from
counting interrupted tasks in the time measurement.
Example measurements for select_task_rq_fair running "hackbench 100
process 1000":
| tracing/trace_stat/function0 | function_graph
Before patch | 2.802 us | 4.255 us
After patch | 2.749 us | 3.094 us
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(Cherry picked from commit 48078d2dac0a26f84f5f3ec704f24f7c832cce14)
Note: Upstream fix needed an extra parameter of NULL for
prepare_ftrace_return.
arch, mm: convert all architectures to use 5level-fixup.h
If an architecture uses 4level-fixup.h we don't need to do anything as
it includes 5level-fixup.h.
If an architecture uses pgtable-nop*d.h, define __ARCH_USE_5LEVEL_HACK
before inclusion of the header. It makes asm-generic code to use
5level-fixup.h.
If an architecture has 4-level paging or folds levels on its own,
include 5level-fixup.h directly.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 9849a5697d3defb2087cb6b9be5573a142697889)
Kirill A. Shutemov [Thu, 9 Mar 2017 14:24:04 +0000 (17:24 +0300)]
asm-generic: introduce __ARCH_USE_5LEVEL_HACK
We are going to introduce <asm-generic/pgtable-nop4d.h> to provide
abstraction for properly (in opposite to 5level-fixup.h hack) folded
p4d level. The new header will be included from pgtable-nopud.h.
If an architecture uses <asm-generic/nop*d.h>, we cannot use
5level-fixup.h directly to quickly convert the architecture to 5-level
paging as it would conflict with pgtable-nop4d.h.
With this patch an architecture can define __ARCH_USE_5LEVEL_HACK before
inclusion <asm-genenric/nop*d.h> to use 5level-fixup.h.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 30ec842660bd0d056d4a7028ac5bd4a82b113d4f)
Kirill A. Shutemov [Thu, 9 Mar 2017 14:24:03 +0000 (17:24 +0300)]
asm-generic: introduce 5level-fixup.h
We are going to switch core MM to 5-level paging abstraction.
This is preparation step which adds <asm-generic/5level-fixup.h>
As with 4level-fixup.h, the new header allows quickly make all
architectures compatible with 5-level paging in core MM.
In long run we would like to switch architectures to properly folded p4d
level by using <asm-generic/pgtable-nop4d.h>, but it requires more
changes to arch-specific code.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 505a60e225606fbd3d2eadc31ff793d939ba66f1)
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Tushar Dave <tushar.n.dave@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
Eric Snowberg [Wed, 10 May 2017 14:50:11 +0000 (07:50 -0700)]
sparc64: /sys/firmware/efi missing during EFI boot
The newest version of OBP is capable of doing an EFI boot. When Linux
is booted thru this EFI loader, the /sys/firmware/efi directory does
not exist. Many userspace applications, such as GRUB, check whether
the dir /sys/firmware/efi exists, if it exists it means
the kernel has booted in EFI mode.
A new Open Firmware property called efi-booter has been added
to /chosen. This new property is only present when doing an
EFI boot.
Signed-off-by: Eric Snowberg <eric.snowberg@oracle.com>
Reviewed-by Thomas Tai <thomas.tai@oracle.com>
Orabug: 26037358 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Allen Pais [Fri, 2 Dec 2016 08:01:47 +0000 (13:31 +0530)]
Allow default value of npools used for iommu to be configured from cmdline
The default value of the number of pools used by the pooled IOMMU
allocator in lib/iommu-common.c is a constant today (set at 16).
It is possible that, for some platforms and some devices, the combination
of latency and frequency of iommu alloc/free requests may be such
as to trigger fragmentation within a pool, leading to iommu alloc failure.
Reducing the number of pools (and thus increasing the pool size) can
minimize the risk of those failures.
This patch provides a command line hook to set the default number of
pools at boot time.
Ported to UEK4
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
George Kennedy [Mon, 15 May 2017 14:43:56 +0000 (07:43 -0700)]
SPARC64: Add Linux vds driver Device ID support for Solaris guest boot
Currently, Solaris guest backend disk images cannot be moved from the Device ID
they were created at and still boot. This bug fix adds Solaris Device ID
support to the Linux vds driver to allow a Solaris guest backend disk image to
be moved to a different device ID from where it was created and still boot.
The Linux vds driver support added in this bug is for Solaris disk images
only. In the future, Solaris Device ID support for physical disk backends will
be added to the Linux vds driver as well.
From PSARC/1995/352:
Solaris Device IDs provide a means for identifying a device, independent of the
device's current name or device number. The instance number of a device number
may change across reconfiguration boots, changing the device number (dev_t) for
that device. Operator errors in recabling can cause devices to swap logical
device names, introducing the potential for data loss.
Signed-off-by: George Kennedy <george.kennedy@oracle.com> Reviewed-by: Alexandre Chartre <Alexandre.Chartre@oracle.com>
Orabug: 25836231 Signed-off-by: Allen Pais <allen.pais@oracle.com>
Some huge page virtual addresses do not work with get_user_pages. Since
the purpose of calling get_user_pages is for its locking side effect, it
is not at all necessary for huge pages since they are permanently
pinned. So the failure is avoided and the unnecessary locking/unlocking
is eliminated.
Reviewed-by: Jonathan Helman <jonathan.helman@oracle.com> Acked-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: Sanath Kumar <sanath.s.kumar@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
Thomas Tai [Mon, 8 May 2017 20:37:40 +0000 (13:37 -0700)]
ldmvsw: unregistering netdev before disable hardware
When running LDom binding/unbinding test, kernel may panic
in ldmvsw_open(). It is more likely that because we're removing
the ldc connection before unregistering the netdev in vsw_port_remove(),
we set up a window of time where one process could be removing the
device while another trying to UP the device. This also sometimes causes
vio handshake error due to opening a device without closing it completely.
We should unregister the netdev before we disable the "hardware".
Signed-off-by: Thomas Tai <thomas.tai@oracle.com> Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
Jane Chu [Wed, 15 Mar 2017 21:58:46 +0000 (14:58 -0700)]
arch/sparc: Measure receiver forward progress to avoid send mondo timeout
A large sun4v SPARC system may have moments of intensive xcall activities,
usually caused by unmapping many pages on many CPUs concurrently. This can
flood receivers with CPU mondo interrupts for an extended period, causing
some unlucky senders to hit send-mondo timeout. This problem gets worse
as cpu count increases because sometimes mappings must be invalidated on
all CPUs, and sometimes all CPUs may gang up on a single CPU.
But a busy system is not a broken system. In the above scenario, as long
as the receiver is making forward progress processing mondo interrupts,
the sender should continue to retry.
This patch implements the receiver's forward progress meter by introducing
a per cpu counter 'cpu_mondo_counter[cpu]' where 'cpu' is in the range
of 0..NR_CPUS. The receiver increments its counter as soon as it receives
a mondo and the sender tracks the receiver's counter. Every 10000 retries,
if the receiver has stopped making forward progress, the sender declares
send-mondo-timeout and panic; otherwise, the receiver is allowed to keep
making forward progress.
Orabug: 25476541 Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-By: Steve Sistare <steven.sistare@oracle.com> Reviewed-By: Anthony Yznaga <anthony.yznaga@oracle.com> Reviewed-by: Rob Gardner <rob.gardner@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
DAX submit needs to be updated to the latest HV spec. Along with a couple
small updates, the biggest modification is changing nomap_va to
status_data. This is mostly a cosmetic change but also adds support to
return the unavailable code via the exec ioctl. Further, augment the
comments and fix up a couple nits in the ccb submit hcall in hypervisor.h.
Signed-off-by: Jonathan Helman <jonathan.helman@oracle.com> Reviewed-by: Rob Gardner <rob.gardner@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Reviewed-by: Atish Patra <atish.patra@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
jane Chu [Wed, 22 Mar 2017 22:49:05 +0000 (16:49 -0600)]
arch/sparc: support NR_CPUS = 4096
Linux SPARC64 limits NR_CPUS to 4064 because init_cpu_send_mondo_info()
only allocates a single page for NR_CPUS mondo entries. Thus we cannot
use all 4096 CPUs on some SPARC platforms.
To fix, allocate (2^order) pages where order is set according to the size
of cpu_list for possible cpus. Since cpu_list_pa and cpu_mondo_block_pa
are not used in asm code, there are no imm13 offsets from the base PA
that will break because they can only reach one page.
Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Bob Picco <bob.picco@oracle.com> Reviewed-by: Atish Patra <atish.patra@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-By: Jane Chu <jane.chu@oracle.com> Reviewed-By: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
Dave Aldridge [Thu, 27 Apr 2017 09:20:18 +0000 (03:20 -0600)]
sparc64: fix fault handling in NGbzero.S and GENbzero.S
When any of the functions contained in NGbzero.S and GENbzero.S
are being run, we may end up taking a fault when executing one
of the store alternate address space instructions. If this
happens, the exception handler does not restore the %asi
register.
This commit fixes the issue by introducing a new exception
handler that ensures the %asi register is restored when
a fault is handled.
Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com> Reviewed-by: Rob Gardner <rob.gardner@oracle.com> Reviewed-by: Babu Moger <babu.moger@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
Modify sys_dax.h such that new libdax can be compiled by including this
file unmodified. Userspace does not have u16, u32, etc. types defined and
as stated in Section 5e of Documentation/CodingStyle, we should be using
__u16, __u32, etc. in the ioctl structures which are exported to userspace.
Further, rename the DAXIOC_DEP_[number] ioctls and use DAXIOC_[name]_OLD
instead.
Signed-off-by: Jonathan Helman <jonathan.helman@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Sanath Kumar <sanath.s.kumar@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
Scott Wood [Sat, 29 Apr 2017 00:17:41 +0000 (19:17 -0500)]
bnx2x: Align RX buffers
The bnx2x driver is not providing proper alignment on the receive buffers it
passes to build_skb(), causing skb_shared_info to be misaligned.
skb_shared_info contains an atomic, and while PPC normally supports
unaligned accesses, it does not support unaligned atomics.
Aligning the size of rx buffers will ensure that page_frag_alloc() returns
aligned addresses.
This can be reproduced on PPC by setting the network MTU to 1450 (or other
non-multiple-of-4) and then generating sufficient inbound network traffic
(one or two large "wget"s usually does it), producing the following oops:
Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
David Miller [Sun, 19 Jun 2016 06:52:25 +0000 (23:52 -0700)]
PCI: Fix unaligned accesses in VC code
The save/restore buffers for VC state is first composed of a 2-byte control
register, then a bunch of 4-byte words.
This causes unaligned accesses which trap on platform such as sparc.
This is easy to fix by simply moving the buffer pointer forward by 4 bytes
instead of 2 after dealing with the control register. The length
adjustment needs to be changed likewise as well.
Orabug: 25806778
Cherry-picked from b77b3610 PCI: Fix unaligned accesses in VC code
Fixes: 5f8fc43217a0 ("PCI: Include pci/pcie/Kconfig directly from pci/Kconfig") Reported-by: Meelis Roos <mroos@linux.ee> Reported-by: Anatoly Pugachev <matorola@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> CC: stable@vger.kernel.org # v4.6+ Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
CONFIG_PROVE_LOCKING_SMALL shrinks the memory usage of lockdep so the
kernel text, data, and bss fit in the locked TLB entries allotted for
the kernel, but this option is not set for every config that enables
lockdep.
A 4.10 kernel fails to boot with the console output
Kernel: Using 8 locked TLB entries for main kernel image.
hypervisor_tlb_lock[2000000:0:8000000071c007c3:1]: errors with f
Program terminated
To fix, rename CONFIG_PROVE_LOCKING_SMALL to CONFIG_LOCKDEP_SMALL, and
enable this option with CONFIG_LOCKDEP=y so we get the reduced memory
usage every time lockdep is turned on.
Tested that CONFIG_LOCKDEP_SMALL is set to 'y' if and only if
CONFIG_LOCKDEP is set to 'y'. When other lockdep-related config options
that select CONFIG_LOCKDEP are enabled (e.g. CONFIG_LOCK_STAT or
CONFIG_PROVE_LOCKING), verified that CONFIG_LOCKDEP_SMALL is also
enabled.
Fixes: 64740b06b7e5 ("config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc") Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Babu Moger <babu.moger@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Allen Pais <allen.pais@oracle.com>
config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc
This new config parameter limits the space used for "Lock debugging:
prove locking correctness" by about 4MB. The current sparc systems have
the limitation of 32MB size for kernel size including .text, .data and
.bss sections. With PROVE_LOCKING feature, the kernel size could grow
beyond this limit and causing system boot-up issues. With this option,
kernel limits the size of the entries of lock_chains, stack_trace etc.
so that kernel fits in required size limit. This is not visible to user
and only used for sparc.
Thomas Tai [Thu, 27 Apr 2017 17:51:48 +0000 (10:51 -0700)]
sparc64: fix cdev_put() use-after-free when unbinding an LDom
After turning on slub_debug=P kernel option, a kernel panic happens when
unbinding an LDom. This suggests that there is memory corruption.
The memory corruption is caused by vlds_fops_release() freeing a memory
structure containing a cdev. The cdev is needed by fs/file_table.c
after the file is released.
The common approach to solve this issue is to add a kobject member
in the structure and set it to be the parent of cdev. The kobject is
then responsible to free the structure when the reference count is
zero. The reference solution is based on the following patch.
Signed-off-by: Thomas Tai <thomas.tai@oracle.com> Reviewed-By: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com> Reviewed-by: Tom Saeger <tom.saeger@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
The CCB_EXEC ioctl in the DAX driver returns ENOBUFS when the user must
free completion areas before the submission can succeed. There is a
dax_err() print when this condition occurs. This print should be changed to
a dax_dbg() print since this return value can be used by the caller to
trigger freeing the completion areas, hence an error print is too verbose.
Signed-off-by: Jonathan Helman <jonathan.helman@oracle.com> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Reviewed-by: Sanath Kumar <sanath.s.kumar@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
Vitaly Kuznetsov [Sun, 1 May 2016 02:21:33 +0000 (19:21 -0700)]
Drivers: hv: kvp: fix IP Failover
Hyper-V VMs can be replicated to another hosts and there is a feature to
set different IP for replicas, it is called 'Failover TCP/IP'. When
such guest starts Hyper-V host sends it KVP_OP_SET_IP_INFO message as soon
as we finish negotiation procedure. The problem is that it can happen (and
it actually happens) before userspace daemon connects and we reply with
HV_E_FAIL to the message. As there are no repetitions we fail to set the
requested IP.
Solve the issue by postponing our reply to the negotiation message till
userspace daemon is connected. We can't wait too long as there is a
host-side timeout (cca. 75 seconds) and if we fail to reply in this time
frame the whole KVP service will become inactive. The solution is not
ideal - if it takes userspace daemon more than 60 seconds to connect
IP Failover will still fail but I don't see a solution with our current
separation between kernel and userspace parts.
Other two modules (VSS and FCOPY) don't require such delay, leave them
untouched.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25970637
(cherry picked from commit 4dbfc2e68004c60edab7e8fd26784383dd3ee9bc) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
K. Y. Srinivasan [Fri, 26 Feb 2016 23:13:19 +0000 (15:13 -0800)]
Drivers: hv: util: Pass the channel information during the init call
Pass the channel information to the util drivers that need to defer
reading the channel while they are processing a request. This would address
the following issue reported by Vitaly:
Commit 3cace4a61610 ("Drivers: hv: utils: run polling callback always in
interrupt context") removed direct *_transaction.state = HVUTIL_READY
assignments from *_handle_handshake() functions introducing the following
race: if a userspace daemon connects before we get first non-negotiation
request from the server hv_poll_channel() won't set transaction state to
HVUTIL_READY as (!channel) condition will fail, we set it to non-NULL on
the first real request from the server.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25970637
(cherry picked from commit b9830d120cbe155863399f25eaef6aa8353e767f) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Olaf Hering [Tue, 15 Dec 2015 00:01:33 +0000 (16:01 -0800)]
Drivers: hv: utils: run polling callback always in interrupt context
All channel interrupts are bound to specific VCPUs in the guest
at the point channel is created. While currently, we invoke the
polling function on the correct CPU (the CPU to which the channel
is bound to) in some cases we may run the polling function in
a non-interrupt context. This potentially can cause an issue as the
polling function can be interrupted by the channel callback function.
Fix the issue by running the polling function on the appropriate CPU
at interrupt level. Additional details of the issue being addressed by
this patch are given below:
Currently hv_fcopy_onchannelcallback is called from interrupts and also
via the ->write function of hv_utils. Since the used global variables to
maintain state are not thread safe the state can get out of sync.
This affects the variable state as well as the channel inbound buffer.
As suggested by KY adjust hv_poll_channel to always run the given
callback on the cpu which the channel is bound to. This avoids the need
for locking because all the util services are single threaded and only
one transaction is active at any given point in time.
Additionally, remove the context variable, they will always be the same as
recv_channel.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25970637
(cherry picked from commit 3cace4a616108539e2730f8dc21a636474395e0f) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
K. Y. Srinivasan [Tue, 15 Dec 2015 00:01:32 +0000 (16:01 -0800)]
Drivers: hv: util: Increase the timeout for util services
Util services such as KVP and FCOPY need assistance from daemon's running
in user space. Increase the timeout so we don't prematurely terminate
the transaction in the kernel. Host sets up a 60 second timeout for
all util driver transactions. The host will retry the transaction if it
times out. Set the guest timeout at 30 seconds.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25970637
(cherry picked from commit c0b200cfb0403740171c7527b3ac71d03f82947a) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Vitaly Kuznetsov [Sat, 1 Aug 2015 23:08:11 +0000 (16:08 -0700)]
Drivers: hv: kvp: check kzalloc return value
kzalloc() return value check was accidentally lost in 11bc3a5fa91f:
"Drivers: hv: kvp: convert to hv_utils_transport" commit.
We don't need to reset kvp_transaction.state here as we have the
kvp_timeout_func() timeout function and in case we're in OOM situation
it is preferable to wait.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25970637
(cherry picked from commit b36fda339729a974a8838978dcdc581d8ce68fd9) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Introduce VSS_OP_REGISTER1 to support kernel replying to the negotiation
message with its own version.
Add small change to vss_handle_handshake for RH compatibility
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Tested-by: Alex Ng <alexng@microsoft.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25970637
(cherry picked from commit cd8dc0548511efff7a97d978f989ce67a883f9a5) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Venkat Venkatsubra [Mon, 8 May 2017 11:23:13 +0000 (04:23 -0700)]
RDS/IB: 4KB receive buffers get posted by mistake on 16KB frag connections.
When connections are at 4KB fragments and then it moves to 16KB frags
(for example during uek2 to uek4 upgrade) we see 4KB buffers getting
posted on 16KB connections. This is happening because the 4KB buffers
(buffers from previous connection before the move to 16KB) are getting
added back to the current connection's (16KB) cache.
We will fix this by doing the following.
1) When the recv buffers get freed/released after either the application
is done reading it or the socket gets closed (process dies, etc.)
and RDS/IB decides to add that buffer back into the current cache,
make sure the frag size matches with that of the current connection.
2) When recv completion reports IB_WC_LOC_LEN_ERR, mark the connection state
as "buffers need to be rebuilt during reconnection". And at the time of
reconnect rebuild the cache even though the "frag size of the connection"
has not changed.
Ajaykumar Hotchandani [Fri, 5 May 2017 19:08:32 +0000 (12:08 -0700)]
mlx4: limit max MSIX allocations
We get more than 64 MSI-X vectors from CX3 firmware 2.35.5530 onwards.
This results in in legacy mode EQ allocs after 64 EQs, which ends up
flooding 3 vectors and causing performance degradation.
With this patch, we limit max vector allocations MAX_MSIX(64).
When Mellanox driver can support more EQs without getting into legacy
mode, this patch should go away.
Peter Zijlstra [Sun, 13 Dec 2015 21:11:16 +0000 (22:11 +0100)]
sched/wait: Fix the signal handling fix
Jan Stancek reported that I wrecked things for him by fixing things for
Vladimir :/
His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which
should not be possible, however my previous patch made this possible by
unconditionally checking signal_pending().
We cannot use current->state as was done previously, because the
instruction after the store to that variable it can be changed. We must
instead pass the initial state along and use that.
Fixes: 68985633bccb ("sched/wait: Fix signal handling in bit wait helpers") Reported-by: Jan Stancek <jstancek@redhat.com> Reported-by: Chris Mason <clm@fb.com> Tested-by: Jan Stancek <jstancek@redhat.com> Tested-by: Vladimir Murzin <vladimir.murzin@arm.com> Tested-by: Chris Mason <clm@fb.com> Reviewed-by: Paul Turner <pjt@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: tglx@linutronix.de Cc: Oleg Nesterov <oleg@redhat.com> Cc: hpa@zytor.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Orabug: 25908266
(cherry picked from commit dfd01f026058a59a513f8a365b439a0681b803af) Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Eric Dumazet [Wed, 30 Dec 2015 13:51:12 +0000 (08:51 -0500)]
udp: properly support MSG_PEEK with truncated buffers
Backport of this upstream commit into stable kernels : 89c22d8c3b27 ("net: Fix skb csum races when peeking")
exposed a bug in udp stack vs MSG_PEEK support, when user provides
a buffer smaller than skb payload.
In this case,
skb_copy_and_csum_datagram_iovec(skb, sizeof(struct udphdr),
msg->msg_iov);
returns -EFAULT.
This bug does not happen in upstream kernels since Al Viro did a great
job to replace this into :
skb_copy_and_csum_datagram_msg(skb, sizeof(struct udphdr), msg);
This variant is safe vs short buffers.
For the time being, instead reverting Herbert Xu patch and add back
skb->ip_summed invalid changes, simply store the result of
udp_lib_checksum_complete() so that we avoid computing the checksum a
second time, and avoid the problematic
skb_copy_and_csum_datagram_iovec() call.
This patch can be applied on recent kernels as it avoids a double
checksumming, then backported to stable kernels as a bug fix.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 197c949e7798fbf28cfadc69d9ca0c2abbf93191)
Santosh Shilimkar [Wed, 7 Dec 2016 23:06:59 +0000 (15:06 -0800)]
net/mlx4_core: panic the system on unrecoverable errors
Mellanox catastrophic error recovery after device reset doesn't work and
in fact leads to unusable node for IB network since the HCA's ports
go down. At times hard reset is needed to get the system rebooted
which is a real problem in production environment. Once the
network outage detected, unreachable node gets evicted and rebooted
on engineered system using reboot. So hanged reboot command is
problematic. So the idea is let the kernel panic which can recover
system on its own with necessary logs captured. There was a debate
on whether to use panic or machine restart, but it was agreed to use
panic instead of silent reboot since thats the preferred option.
There is Mellanox case open to investigate this issue. As such this
is a rare case scenario and even if the issue is fixed, it is expected
to avoid leading to catas error case. This panic is limited to
only error case.
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Reviewed-by: Mukesh Kacker <mukesh.kacker@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Orabug: 25873690
This is a change taken from QU2, it is not upstream.
(cherry picked from commit 271d694b34bd22e5632eaad41ea1d9a47f1bde3a) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
In forwarding table lookup function, xve_fwt_lookup(),
in case of error condition xve_fwt lock is not released.
This commit fixes this bug by releasing xve_fwt lock on error.
Reviewed-by: Chien Yen <chien.yen@oracle.com> Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Mukesh Kacker [Sun, 12 Feb 2017 00:42:56 +0000 (16:42 -0800)]
mlx4_core: Add func name to common error strings to locate uniquely
We add function names (and where needed line numbers) to
some repeated error strings so we can identify the failure
location uniquely for ease of debugging.
Commit "mlx4_ib: Memory leak on Dom0 with SRIOV" introduced an error,
that the CM message DREQ was silently dropped by the PF passive side,
if the disconnect happened more than 5 seconds after the RTU was
received.
Orabug 25829233 documents that there is memory leak in the mlx4 driver
when the DomUs are destroyed while active. But this patchset does not
influence this leak. The leak is tracked by orabug 25946511.
This commit is a first step to make the uek4 tunneling proxy equal to
upstream and thereafter fix bugs both places.
Commit "mlx4_ib: Memory leak on Dom0 with SRIOV" introduced an error,
that the CM message DREQ was silently dropped by the PF passive side,
if the disconnect happened more than 5 seconds after the RTU was
received.
In order to cleanly revert it, this dependant commit needs to be
reverted as well.
Orabug 25829233 documents that there is memory leak in the mlx4 driver
when the DomUs are destroyed while active. But this patchset does not
influence this leak. The leak is tracked by orabug 25946511.
Note that this commit also included a renaming of a variable. This
will be re-introduced in a later commit.
Convert to hv_utils_transport to support both netlink and /dev/vmbus/hv_vss communication methods.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Tested-by: Alex Ng <alexng@microsoft.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25819105
(cherry picked from commit 6472f80a2eeb34b442542bccd4d600e9251d9c36) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Drivers: hv: vss: switch to using the hvutil_device_state state machine
Switch to using the hvutil_device_state state machine from using kvp_transaction.active.
State transitions are:
-> HVUTIL_DEVICE_INIT when driver loads or on device release
-> HVUTIL_READY if the handshake was successful
-> HVUTIL_HOSTMSG_RECEIVED when there is a non-negotiation message from the host
-> HVUTIL_USERSPACE_REQ after we sent the message to the userspace daemon
-> HVUTIL_USERSPACE_RECV after/if the userspace daemon has replied
-> HVUTIL_READY after we respond to the host
-> HVUTIL_DEVICE_DYING on driver unload
In hv_vss_onchannelcallback() process ICMSGTYPE_NEGOTIATE messages even when
the userspace daemon is disconnected, otherwise we can make the host think
we don't support VSS and disable the service completely.
Unfortunately there is no good way we can figure out that the userspace daemon
has died (unless we start treating all timeouts as such), add a protection
against processing new VSS_OP_REGISTER messages while being in the middle of a
transaction (HVUTIL_USERSPACE_REQ or HVUTIL_USERSPACE_RECV state).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Tested-by: Alex Ng <alexng@microsoft.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25819105
(cherry picked from commit 086a6f68d6933d3c48b3898752cd6ca1a0e02aec) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Drivers: hv: vss: process deferred messages when we complete the transaction
In theory, the host is not supposed to issue any requests before be reply to
the previous one. In KVP we, however, support the following scenarios:
1) A message was received before userspace daemon registered;
2) A message was received while the previous one is still being processed.
In VSS we support only the former. Add support for the later, use
hv_poll_channel() to do the job.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Tested-by: Alex Ng <alexng@microsoft.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25819105
(cherry picked from commit 38c06c29bada78c4805000bfb9b7f19cd691461b) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Convert to hv_utils_transport to support both netlink and /dev/vmbus/hv_kvp communication methods.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Tested-by: Alex Ng <alexng@microsoft.com> Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Orabug: 25819105
(cherry picked from commit 11bc3a5fa91f193b3d947a4cf51e21c4aa13292d) Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
There is nothing wrong with coalescing during defragmentation, it
reduces truesize overhead and simplifies things for the receiving
socket (no fraglist walk needed).
However, it also destroys geometry of the original fragments.
While that doesn't cause any breakage (we make sure to not exceed largest
original size) ip_do_fragment contains a 'fastpath' that takes advantage
of a present frag list and results in fragments that (in most cases)
match what was received.
In case its needed the coalescing could be done later, when we're sure
the skb is not forwarded. But discussion during NFWS resulted in
'lets just remove this for now'.
Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 14fe22e334623e451b5592193415c644005461ea)
Kees Cook has pointed out that xfrm_replay_state_esn_len() is subject to
wrapping issues. To ensure we are correctly ensuring that the two ESN
structures are the same size compare both the overall size as reported
by xfrm_replay_state_esn_len() and the internal length are the same.
Signed-off-by: Andy Whitcroft <apw@canonical.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f843ee6dd019bcece3e74e76ad9df0155655d0df) Signed-off-by: Brian Maly <brian.maly@oracle.com>
When a new xfrm state is created during an XFRM_MSG_NEWSA call we
validate the user supplied replay_esn to ensure that the size is valid
and to ensure that the replay_window size is within the allocated
buffer. However later it is possible to update this replay_esn via a
XFRM_MSG_NEWAE call. There we again validate the size of the supplied
buffer matches the existing state and if so inject the contents. We do
not at this point check that the replay_window is within the allocated
memory. This leads to out-of-bounds reads and writes triggered by
netlink packets. This leads to memory corruption and the potential for
priviledge escalation.
We already attempt to validate the incoming replay information in
xfrm_new_ae() via xfrm_replay_verify_len(). This confirms that the user
is not trying to change the size of the replay state buffer which
includes the replay_esn. It however does not check the replay_window
remains within that buffer. Add validation of the contained
replay_window.
Signed-off-by: Andy Whitcroft <apw@canonical.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 677e806da4d916052585301785d847c3b3e6186a) Signed-off-by: Brian Maly <brian.maly@oracle.com>
If lpfc rejects a PRLI that is sent from a target the target will not resend
and will reject the PRLI send from the initiator.
Signed-off-by: Dick Kennedy <dick.kennedy@broadcom.com> Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Joe Jin <joe.jin@oracle.com>
Alexander Popov [Tue, 28 Feb 2017 16:54:40 +0000 (19:54 +0300)]
tty: n_hdlc: get rid of racy n_hdlc.tbuf
Currently N_HDLC line discipline uses a self-made singly linked list for
data buffers and has n_hdlc.tbuf pointer for buffer retransmitting after
an error.
The commit be10eb7589337e5defbe214dae038a53dd21add8
("tty: n_hdlc add buffer flushing") introduced racy access to n_hdlc.tbuf.
After tx error concurrent flush_tx_queue() and n_hdlc_send_frames() can put
one data buffer to tx_free_buf_list twice. That causes double free in
n_hdlc_release().
Let's use standard kernel linked list and get rid of n_hdlc.tbuf:
in case of tx error put current data buffer after the head of tx_buf_list.
Jiri Slaby [Thu, 26 Nov 2015 18:28:26 +0000 (19:28 +0100)]
TTY: n_hdlc, fix lockdep false positive
The class of 4 n_hdls buf locks is the same because a single function
n_hdlc_buf_list_init is used to init all the locks. But since
flush_tx_queue takes n_hdlc->tx_buf_list.spinlock and then calls
n_hdlc_buf_put which takes n_hdlc->tx_free_buf_list.spinlock, lockdep
emits a warning:
=============================================
[ INFO: possible recursive locking detected ]
4.3.0-25.g91e30a7-default #1 Not tainted
---------------------------------------------
a.out/1248 is trying to acquire lock:
(&(&list->spinlock)->rlock){......}, at: [<ffffffffa01fd020>] n_hdlc_buf_put+0x20/0x60 [n_hdlc]
but task is already holding lock:
(&(&list->spinlock)->rlock){......}, at: [<ffffffffa01fdc07>] n_hdlc_tty_ioctl+0x127/0x1d0 [n_hdlc]
other info that might help us debug this:
Possible unsafe locking scenario:
Fixes: 376c7311bdb6 ("net: add a temporary sanity check in skb_orphan()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Aniket Alshi <aniket.alshi@oracle.com>
(cherry picked from commit 8b74d439e1697110c5e5c600643e823eb1dd0762) Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Paolo Abeni [Tue, 21 Feb 2017 08:33:18 +0000 (09:33 +0100)]
ip: fix IP_CHECKSUM handling
The skbs processed by ip_cmsg_recv() are not guaranteed to
be linear e.g. when sending UDP packets over loopback with
MSGMORE.
Using csum_partial() on [potentially] the whole skb len
is dangerous; instead be on the safe side and use skb_checksum().
Thanks to syzkaller team to detect the issue and provide the
reproducer.
v1 -> v2:
- move the variable declaration in a tighter scope
Fixes: ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv") Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit ca4ef4574f1ee5252e2cd365f8f5d5bafd048f32)
Eric Dumazet [Mon, 24 Oct 2016 01:03:06 +0000 (18:03 -0700)]
udp: fix IP_CHECKSUM handling
First bug was added in commit ad6f939ab193 ("ip: Add offset parameter to
ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));
Then commit e6afc8ace6dd ("udp: remove headers from UDP packets before
queueing") forgot to adjust the offsets now UDP headers are pulled
before skb are put in receive queue.
Fixes: ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv") Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Sam Kumar <samanthakumar@google.com> Cc: Willem de Bruijn <willemb@google.com> Tested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 10df8e6152c6c400a563a673e9956320bfce1871)
Willem de Bruijn [Thu, 7 Apr 2016 22:12:59 +0000 (18:12 -0400)]
udp: do not expect udp headers in recv cmsg IP_CMSG_CHECKSUM
On udp sockets, recv cmsg IP_CMSG_CHECKSUM returns a checksum over
the packet payload. Since commit e6afc8ace6dd pulled the headers,
taking skb->data as the start of transport header is incorrect. Use
the transport header pointer.
Also, when peeking at an offset from the start of the packet, only
return a checksum from the start of the peeked data. Note that the
cmsg does not subtract a tail checkum when reading truncated data.
Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 31c2e4926fe912f88388bcaa8450fcaa8f2ece47)