www.infradead.org Git - users/jedix/linux-maple.git/log

NVMe: Set affinity after allocating request queues
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The asynchronous namespace scanning caused affinity hints to be set before
its tagset initialized, so there was no cpu mask to set the hint. This
patch moves the affinity hint setting to after namespaces are scanned.

Reported-by: 김경산 <ks0204.kim@samsung.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit bda4e0fb3126aca15586d165b5a15a37edc0a984)

Orabug: 25130845
Conflicts:
Manually patched the commit.
drivers/block/nvme-core.c

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

NVMe: Fix IO for extended metadata formats

This fixes io submit ioctl handling when using extended metadata
formats. When these formats are used, the user provides a single virtually
contiguous buffer containing both the block and metadata interleaved,
so the metadata size needs to be added to the total length and not mapped
as a separate transfer.

The command is also driver generated, so this patch does not enforce
blk-integrity extensions provide the metadata buffer.

Reported-by: Marcin Dziegielewski <marcin.dziegielewski@intel.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 71feb364e7faadc681e714f7fdc2bede208ba26c)

Orabug: 25130845
Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

NVMe: Remove hctx reliance for multi-namespace

The driver needs to track shared tags to support multiple namespaces
that may be dynamically allocated or deleted. Relying on the first
request_queue's hctx's is not appropriate as we cannot clear outstanding
tags for all namespaces using this handle, nor can the driver easily track
all request_queue's hctx as namespaces are attached/detached. Instead,
this patch uses the nvme_dev's tagset to get the shared tag resources
instead of through a request_queue hctx.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 42483228d4c019ffc86b8dbea7dfbc3f9566fe7e)

Orabug: 25130845
Conflicts:
nvme_set_irq_hints() needs to check tags instead of hctx and
retain nvme_admin_exit_hctx as exit_hctx
drivers/block/nvme-core.c

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

NVMe: Use requested sync command timeout

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit f4ff414aeb472397d3b4fc15c22ca65bab219ec8)

Orabug: 25130845
Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>

Revert "nvme: move to a new drivers/nvme/host directory"

This reverts commit 57dacad5f2288e3de91f99b29f07b4a2793446d2. We need to
cherry-pick many commits before merging this commit. Hence this commit
is reverted to cherry-pick the commits from upstream.

Orabug: 25130845
Conflicts:
drivers/nvme/host/Kconfig

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

Revert "NVMe: reduce admin queue depth as workaround for Samsung EPIC SQ errata"

This reverts commit ab4538cd6fb47c5a3475d0652830a1d4c8c46167.

Revert "nvme: Limit command retries"

This reverts commit 582575bf4329fa5e29c0f1eae79cb0fb13fded04.

Revert "nvme: avoid cqe corruption when update at the same time as read"

This reverts commit 4369f33dfdd50a5011922d45830e2b69ba4067ce.

Revert "NVMe: Don't unmap controller registers on reset"

This reverts commit 75502b9da27d7be3132b9eb3b7da52eae48c3556.

Revert "NVMe: reverse IO direction for VUC command code F7"

This reverts commit a9ddbd6640c88276b8cc8bea5201bf45df7ab71e.

Revert "NVMe: reduce queue depth as workaround for Samsung EPIC SQ errata"

This reverts commit 881900f628c3bedf653aa677da465ab3b8eddf31.

Merge branch 'uek4/topic/uek-4.1/xen-bug26107942' into uek/uek-next/for-chander-bug26107942

Conflicts:
arch/x86/xen/enlighten.c
drivers/net/xen-netfront.c
fs/proc/generic.c
fs/proc/internal.h

arch/x86/xen/enlighten.c had one header to not be conditionally include with
CONFIG_KEXEC introduced by commit 28a4be540b ("kexec: allow kdump with
crash_kexec_post_notifiers"); fs/proc/* had required exporting a new
symbol to be used by commit ac7bd1728ac4 ("xenfs: Use
proc_create_mount_point() to create /proc/xen"); finally the
xen-netfront.c had already accounted for the changes introduced by
9e13456b6312 and 0d1d6389b930 - hence we simply retain the topic branch
version.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

forcedeth: enable forcedeth kernel option

Orabug: 25571921

The NVIDIA forcedeth nic is used in the customer hosts. As such,
forcedeth driver is needed.

Signed-off-by: yanjun.zhu@oracle.com
Reviewed-by: John Haxby <john.haxby@oracle.com>

ipmi: Edit ambiguous error message for unknown command

IPMI SI interfaces issues clear flag command irrespective
of underlying physical interface. In case the platform does
not recognize this command, it returns correct response
unknown command (0xc1). However, SI interface prints this
as if it is an error, and this leads to ambiguity. This should
only be an info message in case of unknown command and a warning
if platform returns some other error response.

Edit the message to clear the ambiguity.

Orabug: 25461958

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com>
(cherry picked from commit 0f6bfff7803dacb7ebfd765d6a2beb54e018698a)

Conflicts:

drivers/char/ipmi/ipmi_si_intf.c

kabi whitelist: Remove all ib_ symbols from the list.

The following symbols are all used by the sif driver and are the only ib_ symbols
in the current uek4 kabi whitelist:

ib_alloc_device
ib_dealloc_device
ib_dispatch_event
ib_modify_qp_is_ok
ib_rate_to_mult
ib_register_device
ib_umem_get_attrs
ib_umem_release
ib_unregister_device

Remove these symbols from the list to allow a data structure change needed to
fix bug 25723815. This change breaks the kabi in the IB area.

Orabug: 25955825

Signed-off-by: Knut Omang <knut.omang@oracle.com>

ext4: print ext4 mount option data_err=abort correctly

If data_err=abort option is specified for an ext3/ext4 mount,
/proc/mounts does show it as "(null)". This is caused by token2str()
returning NULL for Opt_data_err_abort (due to its pattern containing
'=').

Signed-off-by: Ales Novak <alnovak@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Orabug: 25691020
Acked-by: todd.vierling@oracle.com

IB/sa: Allocate SA query with kzalloc

Orabug: 26124118

Replace kmalloc with kzalloc so that all uninitialized fields in SA query
will be zero-ed out to avoid unintentional consequence. This prepares the
SA query structure to accept new fields in the future.

Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: John Fleck <john.fleck@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit 5d2657708ec25b9fb3dd174443b1f647babcbe62)

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>

IB/sa: Fix netlink local service GFP crash

Orabug: 26124118

The rdma netlink local service registers a handler to handle RESOLVE
response and another handler to handle SET_TIMEOUT request. The first
thing these handlers do is to call netlink_capable() to check the
access right of the received skb to make sure that the sender has root
access. Under normal conditions, such responses and requests will be
directly forwarded to the handlers without going through the netlink_dump
pathway (see ibnl_rcv_msg() in drivers/infiniband/core/netlink.c).
However, a user application could send a RESOLVE request (not response)
to the local service, which will fall into the netlink_dump pathway,
where a new skb will be created without initializing the control block.
This new skb will be eventually forwarded to the local service RESOLVE
response handler. Unfortunately, netlink_capable() will cause general
protection fault if the skb's control block is not initialized. This
patch will address the problem by checking the skb first.

Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit 2deeb4772971e56d5bdac0bd3375d5eadaa827fd)

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>

IB/sa: Fix rdma netlink message flags

Orabug: 26124118

The flags to ibnl_put_msg should be NLM_F_REQUEST instead of GFP_KERNEL.

Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: John Fleck <john.fleck@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit ba13b5f8f86efa78bc0aaea297b0001b6cbf6c21)

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>

IB/sa: Put netlink request into the request list before sending

Orabug: 26124118

It was found by Saurabh Sengar that the netlink code tried to allocate
memory with GFP_KERNEL while holding a spinlock. While it is possible
to fix the issue by replacing GFP_KERNEL with GFP_ATOMIC, it is better
to get rid of the spinlock while sending the packet. However, in order
to protect against a race condition that a quick response may be received
before the request is put on the request list, we need to put the request
on the list first.

Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Reviewed-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reported-by: Saurabh Sengar <saurabh.truth@gmail.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit 3ebd2fd0d0119a5ac7906bf17be637b527f63d31)

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>

IB/core: Fix a potential array overrun in CMA and SA agent

Orabug: 26124118

Fix array overrun when going over callback table.
In declaration of callback table, the max size isn't provided and in
registration phase, it is provided.

There is potential scenario where a new operation is added and it is not
supported by current client. The acceptance of such operation by ib_netlink
will cause to array overrun.

Fixes: 809d5fc9bf65 ("infiniband: pass rdma_cm module to
netlink_dump_start")
Fixes: b493d91d333e ("iwcm: common code for port mapper")
Fixes: 2ca546b92a02 ("IB/sa: Route SA pathrecord query through netlink")
(Backported from commit 2fa2d4fb1166d1ef35f0aacac6165d53ab1b89c7)

Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>

IB/SA: Use correct free function

Orabug: 26124118

Fixes a direct call to kfree_skb when nlmsg_free should be used.

Fixes: 2ca546b92a02 ('IB/sa: Route SA pathrecord query through netlink')
Signed-off-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit 0f377d86252d11bfea941852785e3094b93601a7)

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>

IB/sa: Route SA pathrecord query through netlink

Orabug: 26124118

This patch routes a SA pathrecord query to netlink first and processes the
response appropriately. If a failure is returned, the request will be sent
through IB. The decision whether to route the request to netlink first is
determined by the presence of a listener for the local service netlink
multicast group. If the user-space local service netlink multicast group
listener is not present, the request will be sent through IB, just like
what is currently being done.

Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: John Fleck <john.fleck@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit 2ca546b92a024d07adedd15b4c262b1c2c0786ec)

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>

IB/core: Add rdma netlink helper functions

Orabug: 26124118

This patch adds a function to check if listeners for a netlink multicast
group are present. It also adds a function to receive netlink response
messages.

Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: John Fleck <john.fleck@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit bc10ed7d3d19ff61427007b4d7bf98d3e57bb333)

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>

IB/netlink: Add defines for local service requests through netlink

Orabug: 26124118

This patch adds netlink defines for local service client, local service
group, local service operations, and related attributes.

Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: John Fleck <john.fleck@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from commit 6431eb87065ffd24dfc7c0b6954e80a4eb74e177)

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>

scsi: mpt3sas: remove redundant wmb

Orabug: 26096353

Due to relaxed ordering requirements on multiple architectures, drivers
are required to use wmb/rmb/mb combinations when they need to guarantee
observability between the memory and the HW.

The mpt3sas driver is already using wmb() for this purpose. However, it
issues a writel following wmb(). writel() function on arm/arm64
arhictectures have an embedded wmb() call inside.

This results in unnecessary performance loss and code duplication.

writel already guarantees ordering for both cpu and bus. we don't need
additional wmb()

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Acked-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com>
Reviewed-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit b1391a5bf83a593bbe92d1f9bddaf563be5c7c9d)
Signed-off-by: Shan Hai <shan.hai@oracle.com>

scsi: mpt3sas: Updating driver version to v15.100.00.00

Orabug: 26096353

Updated driver version to "15.100.00.00"

Signed-off-by: Chaitra P B <chaitra.basappa@broadcom.com>
Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 7cfa76963f1872461adff2e84edfbaa8e17d189b)
Signed-off-by: Shan Hai <shan.hai@oracle.com>

scsi: mpt3sas: Fix for Crusader to achieve product targets with SAS devices.

Orabug: 26096353

Small glitch/degraded performance in Crusader is improved with SAS
drives by removing unnecessary spinlocks while clearing scsi command in
drivers internal lookup table.

Signed-off-by: Chaitra P B <chaitra.basappa@broadcom.com>
Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 459325c466d278d3c9f51ddc9bb544c014136fd1)
Signed-off-by: Shan Hai <shan.hai@oracle.com>

scsi: mpt3sas: Fix Firmware fault state 0x2100 during heavy 4K RR FIO stress test.

Orabug: 26096353

Due existence of loop in the IO path our HBA will receive heavy IOs and
also as driver is not updating the Reply Post Host Index frequently, So
there will be a high chance that our Firmware unable to find any free
entry in the Reply Post Descriptor Queue (i.e. Queue overflow occurs)
and can observe 0x2100 firmware fault. So to fix this, we have defined
a thresh hold value. After continuously processing this thresh hold
number of reply descriptors driver will update the Reply Descriptor Host
Index so that this thresh hold number of reply descriptors entries will
be freed and these entries will be available for firmware and we won't
observe this Firmware fault. We have defined this threshold value as
1/3rd of the hba queue depth.

Signed-off-by: Chaitra P B <chaitra.basappa@broadcom.com>
Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 6b4c335a0f6cc61c69cd24f24e40b118bd9f778a)
Signed-off-by: Shan Hai <shan.hai@oracle.com>

scsi: mpt3sas: Added print to notify cable running at a degraded speed.

Orabug: 26096353

Driver processes the event MPI26_EVENT_ACTIVE_CABLE_DEGRADED when a
cable is present and is running at a degraded speed (below the SAS3 12
Gb/s rate). Prints added to inform the user that the cable is not
running at optimal speed.

Signed-off-by: Chaitra P B <chaitra.basappa@broadcom.com>
Signed-off-by: Suganath Prabu S <suganath-prabu.subramani@broadcom.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 6c44c0fe91af7bac78dcaf4c106421862530f499)
Signed-off-by: Shan Hai <shan.hai@oracle.com>

xen-blkback: report hotplug-status busy when detach is initiated but frontend device is busy.

In case of deferred detach xm/xend doesn't get notified about busy status
and has to wait timeout (default 100s) to report detach failure to user.
This behavior is sometime incorrectly interpreted as tool hang.

This patch updates the hotplug-status with busy so that xm gets notified
instead of timeout.

Orabug: 26072430
Signed-off-by: Niranjan Patil <niranjan.d.patil@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

qla2xxx: Allow vref count to timeout on vport delete.

This commit fixed a panic could be triggered with following steps:
1.create vhba
  #virsh nodedev-create  vhba.xml
2.destroy vhba
  #virsh nodedev-destroy scsi_host9

Content of file vhba.xml:

<device>
     <parent>scsi_host7</parent>
     <capability type='scsi_host'>
       <capability type='fc_host'>
       </capability>
     </capability>
</device>

Call trace of panic:

[  207.683754] BUG: unable to handle kernel NULL pointer dereference at 0000000000000410
[  207.683805] IP: [<ffffffffa0221d0f>] qla24xx_vport_delete+0xdf/0x180 [qla2xxx]
[  207.683850] PGD 0
[  207.683863] Oops: 0000 [#1] SMP
[  207.684391] CPU: 0 PID: 2029 Comm: libvirtd Not tainted 4.1.12-94.2.1.el7uek.x86_64 #2
[  207.684418] Hardware name: Oracle Corporation ORACLE SERVER X5-2/ASM,MOTHERBOARD,1U, BIOS 30100400 12/26/2016
[  207.684454] task: ffff88026fc31c00 ti: ffff88007278c000 task.ti: ffff88007278c000
[  207.684491] RIP: 0010:[<ffffffffa0221d0f>]  [<ffffffffa0221d0f>] qla24xx_vport_delete+0xdf/0x180 [qla2xxx]
[  207.684535] RSP: 0018:ffff88007278fcf8  EFLAGS: 00010202
[  207.684555] RAX: 0000000000000001 RBX: ffff8802729c17f8 RCX: ffffffffa0258e80
[  207.684578] RDX: 0000000000007086 RSI: 0000000000000000 RDI: ffff88026fef0360
[  207.684601] RBP: ffff88007278fd18 R08: 0000000000000001 R09: ffff88027741ad80
[  207.684625] R10: ffffea0009bbbc00 R11: 0000000000000000 R12: ffff88026fef0000
[  207.684649] R13: 0000000000000001 R14: ffff88026fef0360 R15: 0000000000000021
[  207.684673] FS:  00007f24f538d700(0000) GS:ffff880277400000(0000) knlGS:0000000000000000
[  207.684699] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  207.684719] CR2: 0000000000000410 CR3: 00000002731e7000 CR4: 00000000001406f0
[  207.684744] Stack:
[  207.684755]  ffff8802733dd800 ffff8802728fd000 ffff880474d3f000 ffff880474d3f648
[  207.684785]  ffff88007278fd58 ffffffffa000d344 ffff8804714f4000 ffff8802728fd000
[  207.684814]  ffff8802728fd000 ffff8802733dd800 0000000000000021 ffff880474d3f648
[  207.684861] Call Trace:
[  207.684897]  [<ffffffffa000d344>] fc_vport_terminate+0x44/0x150 [scsi_transport_fc]
[  207.684927]  [<ffffffffa000d594>] store_fc_host_vport_delete+0x144/0x180 [scsi_transport_fc]
[  207.684959]  [<ffffffff81489798>] dev_attr_store+0x18/0x30
[  207.684996]  [<ffffffff81294fbd>] sysfs_kf_write+0x3d/0x50
[  207.685017]  [<ffffffff8129446a>] kernfs_fop_write+0x12a/0x180
[  207.685040]  [<ffffffff812129b7>] __vfs_write+0x37/0x120
[  207.685061]  [<ffffffff812158d8>] ? __sb_start_write+0x58/0x110
[  207.685084]  [<ffffffff812c1743>] ? security_file_permission+0x23/0xa0
[  207.685107]  [<ffffffff812130f9>] vfs_write+0xa9/0x1b0
[  207.685128]  [<ffffffff81736c16>] ? mutex_lock+0x16/0x37
[  207.685147]  [<ffffffff81213fe5>] SyS_write+0x55/0xd0
[  207.685179]  [<ffffffff81738c6e>] system_call_fastpath+0x12/0x71
[  207.685200] Code: 07 00 00 01 0f b7 83 b0 01 00 00 f0 49 0f b3 84 24 30 07 00 00 4c 89 f7 e8 5f 4d 51 e1 48 8b b3 b8 01 00 00 0f b7 83 b0 01 00 00 <66> 39 86 10 04 00 00 74 68 45 0f b7 c5 48 89 de 31 c0 48 c7 c1
[  207.686827] RIP  [<ffffffffa0221d0f>] qla24xx_vport_delete+0xdf/0x180 [qla2xxx]
[  207.687634]  RSP <ffff88007278fcf8>
[  207.688391] CR2: 0000000000000410

Cc: <stable@vger.kernel.org>
Signed-off-by: Joe Carnuccio <joe.carnuccio@cavium.com>
Signed-off-by: Himanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
(cherry picked from commit c4a9b538ab2a109c5f9798bea1f8f4bf93aadfb9)

Orabug: 26021151

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Conflicts:
drivers/scsi/qla2xxx/qla_attr.c
drivers/scsi/qla2xxx/qla_def.h
drivers/scsi/qla2xxx/qla_os.c

Btrfs: don't BUG_ON() in btrfs_orphan_add

Orabug: 25975316

This is just a screwup for developers, so change it to an ASSERT() so developers
notice when things go wrong and deal with the error appropriately if ASSERT()
isn't enabled. Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit 3b6571c180da85e43550c608e954ab7b2a31d954)
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>

Btrfs: clarify do_chunk_alloc()'s return value

Orabug: 25975316

Function start_transaction() can return ERR_PTR(1) when flush is
BTRFS_RESERVE_FLUSH_LIMIT, so the call graph is

start_transaction (return ERR_PTR(1))
  -> btrfs_block_rsv_add (return 1)
     -> reserve_metadata_bytes (return 1)
        -> flush_space (return 1)
           -> do_chunk_alloc  (return 1)

With BTRFS_RESERVE_FLUSH_LIMIT, if flush_space is already on the
flush_state of ALLOC_CHUNK and it successfully allocates a new
chunk, then instead of trying to reserve space again,
reserve_metadata_bytes returns 1 immediately.

Eventually the callers who call start_transaction() usually just
do the IS_ERR() check which ERR_PTR(1) can pass, then it'll get
a panic when dereferencing a pointer which is ERR_PTR(1).

The following patch fixes the above problem.
"btrfs: flush_space: treat return value of do_chunk_alloc properly"
https://patchwork.kernel.org/patch/7778651/

This add comments to clarify do_chunk_alloc()'s return value.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 28b737f6ede3661fe610937706c4a6f50e9ab769)
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>

btrfs: flush_space: treat return value of do_chunk_alloc properly

Orabug: 25975316`

do_chunk_alloc returns 1 when it succeeds to allocate a new chunk.
But flush_space will not convert this to 0, and will also return 1.
As a result, reserve_metadata_bytes will think that flush_space failed,
and may potentially return this value "1" to the caller (depends how
reserve_metadata_bytes was called). The caller will also treat this as an error.
For example, btrfs_block_rsv_refill does:

int ret = -ENOSPC;
...
ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
if (!ret) {
block_rsv_add_bytes(block_rsv, num_bytes, 0);
return 0;
}

return ret;

So it will return -ENOSPC.

Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit eecba891d38051ebf7f4af6394d188a5fd151a6a)
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>

ipv6: Skip XFRM lookup if dst_entry in socket cache is valid

Orabug: 25955089

At present we perform an xfrm_lookup() for each UDPv6 message we
send. The lookup involves querying the flow cache (flow_cache_lookup)
and, in case of a cache miss, creating an XFRM bundle.

If we miss the flow cache, we can end up creating a new bundle and
deriving the path MTU (xfrm_init_pmtu) from on an already transformed
dst_entry, which we pass from the socket cache (sk->sk_dst_cache) down
to xfrm_lookup(). This can happen only if we're caching the dst_entry
in the socket, that is when we're using a connected UDP socket.

To put it another way, the path MTU shrinks each time we miss the flow
cache, which later on leads to incorrectly fragmented payload. It can
be observed with ESPv6 in transport mode:

  1) Set up a transformation and lower the MTU to trigger fragmentation
    # ip xfrm policy add dir out src ::1 dst ::1 \
      tmpl src ::1 dst ::1 proto esp spi 1
    # ip xfrm state add src ::1 dst ::1 \
      proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b
    # ip link set dev lo mtu 1500

  2) Monitor the packet flow and set up an UDP sink
    # tcpdump -ni lo -ttt &
    # socat udp6-listen:12345,fork /dev/null &

  3) Send a datagram that needs fragmentation with a connected socket
    # perl -e 'print "@" x 1470 | socat - udp6:[::1]:12345
    2016/06/07 18:52:52 socat[724] E read(3, 0x555bb3d5ba00, 8192): Protocol error
    00:00:00.000000 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x2), length 1448
    00:00:00.000014 IP6 ::1 > ::1: frag (1448|32)
    00:00:00.000050 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x3), length 1272
    (^ ICMPv6 Parameter Problem)
    00:00:00.000022 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x5), length 136

  4) Compare it to a non-connected socket
    # perl -e 'print "@" x 1500' | socat - udp6-sendto:[::1]:12345
    00:00:40.535488 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x6), length 1448
    00:00:00.000010 IP6 ::1 > ::1: frag (1448|64)

What happens in step (3) is:

  1) when connecting the socket in __ip6_datagram_connect(), we
     perform an XFRM lookup, miss the flow cache, create an XFRM
     bundle, and cache the destination,

  2) afterwards, when sending the datagram, we perform an XFRM lookup,
     again, miss the flow cache (due to mismatch of flowi6_iif and
     flowi6_oif, which is an issue of its own), and recreate an XFRM
     bundle based on the cached (and already transformed) destination.

To prevent the recreation of an XFRM bundle, avoid an XFRM lookup
altogether whenever we already have a destination entry cached in the
socket. This prevents the path MTU shrinkage and brings us on par with
UDPv4.

The fix also benefits connected PINGv6 sockets, another user of
ip6_sk_dst_lookup_flow(), who also suffer messages being transformed
twice.

Joint work with Hannes Frederic Sowa.

Reported-by: Jan Tluka <jtluka@redhat.com>
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 00bc0ef5880dc7b82f9c320dead4afaad48e47be)
Signed-off-by: Todd Vierling <todd.vierling@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>
Conflicts:
net/ipv6/ip6_output.c

xen: Make VPMU init message look less scary

The default for the Xen hypervisor is to not enable VPMU in order to
avoid security issues. In this case the Linux kernel will issue the
message "Could not initialize VPMU for cpu 0, error -95" which looks
more like an error than a normal state.

Change the message to something less scary in case the hypervisor
returns EOPNOTSUPP or ENOSYS when trying to activate VPMU.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Orabug: 25873416

(cherry picked from commit 0252937a87e1d46a8261da85cbd99dffe612a2d3)
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@gmail.com>

uek-rpm: configs: enable CONFIG_ACPI_NFIT

Orabug: 25719149
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>

ipv6: Don't use ufo handling on later transformed packets

Similar to commit c146066ab802 ("ipv4: Don't use ufo handling on later
transformed packets"), don't perform UFO on packets that will be IPsec
transformed. To detect it we rely on the fact that headerlen in
dst_entry is non-zero only for transformation bundles (xfrm_dst
objects).

Unwanted segmentation can be observed with a NETIF_F_UFO capable device,
such as a dummy device:

  DEV=dum0 LEN=1493

  ip li add $DEV type dummy
  ip addr add fc00::1/64 dev $DEV nodad
  ip link set $DEV up
  ip xfrm policy add dir out src fc00::1 dst fc00::2 \
     tmpl src fc00::1 dst fc00::2 proto esp spi 1
  ip xfrm state add src fc00::1 dst fc00::2 \
     proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b

  tcpdump -n -nn -i $DEV -t &
  socat /dev/zero,readbytes=$LEN udp6:[fc00::2]:$LEN

tcpdump output before:

  IP6 fc00::1 > fc00::2: frag (0|1448) ESP(spi=0x00000001,seq=0x1), length 1448
  IP6 fc00::1 > fc00::2: frag (1448|48)
  IP6 fc00::1 > fc00::2: ESP(spi=0x00000001,seq=0x2), length 88

... and after:

  IP6 fc00::1 > fc00::2: frag (0|1448) ESP(spi=0x00000001,seq=0x1), length 1448
  IP6 fc00::1 > fc00::2: frag (1448|80)

Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach")
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit f89c56ce710afa65e1b2ead555b52c4807f34ff7)

Orabug: 25533743
Signed-off-by: Todd Vierling <todd.vierling@oracle.com>

net/packet: fix overflow in check for tp_reserve

Orabug: 25813773
CVE: CVE-2017-7308

When calculating po->tp_hdrlen + po->tp_reserve the result can overflow.

Fix by checking that tp_reserve <= INT_MAX on assign.

Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit bcc5364bdcfe131e6379363f089e7b4108d35b70)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net/packet: fix overflow in check for tp_frame_nr

Orabug: 25813773
CVE: CVE-2017-7308

When calculating rb->frames_per_block * req->tp_block_nr the result
can overflow.

Add a check that tp_block_size * tp_block_nr <= UINT_MAX.

Since frames_per_block <= tp_block_size, the expression would
never overflow.

Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 8f8d28e4d6d815a391285e121c3a53a0b6cb9e7b)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

net/packet: fix overflow in check for priv area size

Orabug: 25813773
CVE: CVE-2017-7308

Subtracting tp_sizeof_priv from tp_block_size and casting to int
to check whether one is less then the other doesn't always work
(both of them are unsigned ints).

Compare them as is instead.

Also cast tp_sizeof_priv to u64 before using BLK_PLUS_PRIV, as
it can overflow inside BLK_PLUS_PRIV otherwise.

Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 2b6867c2ce76c596676bec7d2d525af525fdc6e2)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

fs/file.c: __fget() and dup2() atomicity rules

__fget() does lockless fetch of pointer from the descriptor
table, attempts to grab a reference and treats "it was already
zero" as "it's already gone from the table, we just hadn't
seen the store, let's fail".  Unfortunately, that breaks the
atomicity of dup2() - __fget() might see the old pointer,
notice that it's been already dropped and treat that as
"it's closed".  What we should be getting is either the
old file or new one, depending whether we come before or after
dup2().

Dmitry had following test failing sometimes :

int fd;
void *Thread(void *x) {
  char buf;
  int n = read(fd, &buf, 1);
  if (n != 1)
    exit(printf("read failed: n=%d errno=%d\n", n, errno));
  return 0;
}

int main()
{
  fd = open("/dev/urandom", O_RDONLY);
  int fd2 = open("/dev/urandom", O_RDONLY);
  if (fd == -1 || fd2 == -1)
    exit(printf("open failed\n"));
  pthread_t th;
  pthread_create(&th, 0, Thread, 0);
  if (dup2(fd2, fd) == -1)
    exit(printf("dup2 failed\n"));
  pthread_join(th, 0);
  if (close(fd) == -1)
    exit(printf("close failed\n"));
  if (close(fd2) == -1)
    exit(printf("close failed\n"));
  printf("DONE\n");
  return 0;
}

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Orabug: 25408921
From 25408921:
Signed-off-by: todd.vierling@oracle.com
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>

IB/ipoib: add get_settings in ethtool

In order to let the bonding driver report the correct speed
of the underlaying interfaces, when they are IPoIB, the ethtool
function get_settings() in the IPoIB driver is implemented.

Orabug: 25048521

Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>

RDS/IB: active bonding port state fix for intfs added late

When new interfaces are added after boot or a late notifier
events cause an interface to be added late, there is need
to make sure port state moves to UP or DOWN (and does not
stay in INIT state) regardless of order of the initialization
of data structures racing with NETDEV notifier events.

Without that subsequent failover/failback processing may
not happen properly as it looks for port_state in
UP or DOWN state.

Orabug: 26081079

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
Reviewed-by: Wengang Wang <wen.gang.wang@oracle.com>

Revert "xen/events: remove unnecessary call to bind_evtchn_to_cpu()"

This reverts commit 4201cdbd6cde19a69b862984ef674ce667d526e1.

Customer runing vcpu hot plug/remove in a loop could trigger softlockup.
Backtrace is similar in every test, one of the trace is as below:

NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [modprobe:3686]
CPU: 1 PID: 3686 Comm: modprobe Not tainted 4.1.12-94.1.8.el6uek.x86_64 #2
task: ffff88016be32a00 ti: ffff8800066b0000 task.ti: ffff8800066b0000
RIP: e030:[<ffffffff81108230>]  [<ffffffff81108230>] smp_call_function_many+0x210/0x260
Call Trace:
  [<ffffffff8106ef20>] ? __cpa_flush_all+0x50/0x50
  [<ffffffffa02b9000>] ? 0xffffffffa02b9000
  [<ffffffff81108482>] smp_call_function+0x22/0x30
  [<ffffffff811084eb>] on_each_cpu+0x2b/0x70
  [<ffffffffa02b9fff>] ? 0xffffffffa02b9fff
  [<ffffffffa02b9000>] ? 0xffffffffa02b9000
  [<ffffffff810710cc>] change_page_attr_set_clr+0x3fc/0x530
  [<ffffffff8133d011>] ? list_del+0x11/0x40
  [<ffffffff8134b4f9>] ? ddebug_table_free+0x29/0x30
  [<ffffffff81071313>] set_memory_x+0x43/0x50
  [<ffffffff810e6e40>] ? wait_rcu_gp+0x60/0x60
  [<ffffffff811090c8>] set_page_attributes+0x28/0x30
  [<ffffffff81109168>] unset_module_init_ro_nx+0x38/0x70
  [<ffffffff8110c7ce>] free_module+0xae/0x150
  [<ffffffff8110ca1f>] do_init_module+0x1af/0x200
  [<ffffffff8110f361>] load_module+0x5b1/0x740
  [<ffffffff8110c200>] ? mod_sysfs_teardown+0x150/0x150
  [<ffffffff811cd6c2>] ? __vmalloc+0x22/0x30
  [<ffffffff8110baf0>] ? module_sect_show+0x30/0x30
  [<ffffffff8110f674>] SyS_init_module+0x94/0xc0
  [<ffffffff816e4dae>] system_call_fastpath+0x12/0x71

Not clear the root cause lying behind yet, revert above commit help.

Oracle-Bug: 25997062

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>

xsigo: Compute node crash on FC failover

Orabug: 25981973

xsvhba's internally generated scsi command timeout code prematurely completes
a command rather than relying on qlogic to complete with "CMD_TIMEOUT" code.
Actual command completes just after xsigo timeout completion and
causes the freed buffer to be overwritten with inquiry data.
These code changes will allow scsi mid layer to do the recovery.
The original xsigo timeout code is not there in ESX xsvhba source code and was
mistakenly brought over in uek.

Reviewed-by: sajid.zia@oracle.com
Signed-off-by: Pradeep Gopanapalli <pradeep.gopanapalli@oracle.com>
Signed-off-by: said.zia@oracle.com

Revert "[SCSI] libiscsi: Reduce locking contention in fast path"

Orabug: 25975223

This reverts commit 659743b02c411075b26601725947b21df0bb29c8.

Conflicts:
drivers/scsi/be2iscsi/be_main.c
drivers/scsi/bnx2i/bnx2i_iscsi.c
drivers/scsi/iscsi_tcp.c
drivers/scsi/libiscsi.c
drivers/scsi/be2iscsi/be_main.c

659743b02c41 splits iscsi session lock into two locks, one to be used while
sending a request to the target and the other to be used while processing
a response. This patch has caused multiple bugs due to races while
accessing various lists that hold the iscsi_task in the iscsi_conn
structure.

Although commit 6f8830f5bbab in upstream partially fixes the issue, there
is still atleast one regression seen when the same iscsi task is accessed
simultaneously in iscsi_xmit_task() and iscsi_complete_task() which causes
a null pointer dereference and panic.

Its best to revert this patch until we find a permanent solution.

Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: John Sobecki <john.sobecki@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

nfsd: stricter decoding of write-like NFSv2/v3 ops

Orabug: 25974739
CVE: CVE-2017-7895

The NFSv2/v3 code does not systematically check whether we decode past
the end of the buffer. This generally appears to be harmless, but there
are a few places where we do arithmetic on the pointers involved and
don't account for the possibility that a length could be negative. Add
checks to catch these.

Reported-by: Tuomas Haanpää <thaan@synopsys.com>
Reported-by: Ari Kauppi <ari@synopsys.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
(cherry picked from commit 13bf9fbff0e5e099e2b6f003a0ab8ae145436309)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Conflicts:
fs/nfsd/nfsxdr.c

sched/rt: Minimize rq->lock contention in do_sched_rt_period_timer()

With CONFIG_RT_GROUP_SCHED=y, do_sched_rt_period_timer() sequentially
takes each CPU's rq->lock. On a large, busy system, the cumulative time it
takes to acquire each lock can be excessive, even triggering a watchdog
timeout.

If rt_rq->rt_time and rt_rq->rt_nr_running are both zero, this function does
nothing while holding the lock, so don't bother taking it at all.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/a767637b-df85-912f-ba69-c90ee00a3fb6@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Orabug: 25491970

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>

sparc64: cache_line_size() returns larger value for cache line size.

SPARC currently returns L1 data cache line size (as low as 32 bytes on
some systems) though L2 and L3 cache line sizes may be higher. As
cache_line_size() is used by code to align memory requests to prevent
unnecessary cache line sharing, this patch returns the max of L2 and L3
sizes, currently 64 bytes.

OraBug: 26045057

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>

sparc64: fix inconsistent printing of handles in debug messages

Most debug messages print handles using "%llx" but some use "%llu". Use
"%llx" for all debug messages that print handles.

Signed-off-by: Menno Lageman <menno.lageman@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Aaron Young <aaron.young@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>

sparc64: set the ISCNTRLD bit for SP service handles

Service handles generated by the ds driver can collide with service handles
generated by the SP, causing failures with Domain Services on the SP such
as 'ldom_req_sp_token: set-token failed: no reply' errors.

Ensure that service handles generated by the ds driver do not collide
with service handles generated by the SP by setting the ISCNTRLD bit in
the lower half of the service handle for SP Domain Services. This is
similar to what Solaris does.

Orabug: 25983868

Signed-off-by: Menno Lageman <menno.lageman@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Aaron Young <aaron.young@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>

sparc64: DAX recursive lock removed

At some point in the past, the call to get_user_pages() was changed to
get_user_pages_fast(). The former requires that mmap_sem be held when
making the call, which the driver respected. But the latter requires that
mmap_sem not be held, since it acquires it later. So mmap_sem was being
acquired by the driver, then again in get_user_pages_fast().  In between
these two acquisitions, another thread can come along and call mmap(),
which will wait on the same semaphore, and deadlock with the subsequent
get_user_pages_fast() attempt to get it again.

  Thread 1 Thread 2
  -------- --------
  acquire mmap_sem    .
  call get_user_pages_fast()    .
     . mmap()
     .   acquire mmap_sem (blocks)
     acquire mmap_sem (blocks)

Since get_user_pages_fast() acquires mmap_sem, the dax driver should
not do so.

Orabug: 26103487

Signed-off-by: Rob Gardner <rob.gardner@oracle.com>
Reviewed-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Reviewed-by: Eric Saint-Etienne <eric.saint.etienne@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>

sparc/ftrace: Fix ftrace graph time measurement

The ftrace function_graph time measurements of a given function is not
accurate according to those recorded by ftrace using the function
filters.  This change pulls the x86_64 fix from 'commit 722b3c746953
("ftrace/graph: Trace function entry before updating index")' into the
sparc specific prepare_ftrace_return which stops ftrace from
counting interrupted tasks in the time measurement.

Example measurements for select_task_rq_fair running "hackbench 100
process 1000":

              |  tracing/trace_stat/function0  |  function_graph
Before patch |  2.802 us                      |  4.255 us
After patch  |  2.749 us                      |  3.094 us

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(Cherry picked from commit 48078d2dac0a26f84f5f3ec704f24f7c832cce14)

Note: Upstream fix needed an extra parameter of NULL for
prepare_ftrace_return.

Orabug: 25995351

Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>

sparc64: Increase max_phys_bits to 51 for M8.

On M8 chips, use a max_phys_bits value of 51 and 54 bits for
virtual address.

Orabug: 25808647

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>

sparc64: 5-Level page table support for sparc

Extended Page table to 5-Level for sparc.

Orabug: 26076110
Orabug: 25808647

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>

mm, gup: fix typo in gup_p4d_range()

gup_p4d_range() should call gup_pud_range(), not itself.

[ This was not noticed on x86: this is the HAVE_GENERIC_RCU_GUP code
used by arm[64] and powerpc - Linus ]

Fixes: c2febafc6773 ("mm: convert generic code to 5-level paging")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Reported-by: Anton Blanchard <anton@samba.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit ce70df089143c49385b4f32f39d41fb50fbf6a7c)

Orabug: 25808647

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>

mm: introduce __p4d_alloc()

For full 5-level paging we need a helper to allocate p4d page table.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 90eceff1a375f6ffa78caf8654e787c0a8a591ef)

Orabug: 25808647

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>

mm: convert generic code to 5-level paging

Convert all non-architecture-specific code to 5-level paging.

It's mostly mechanical adding handling one more page table level in
places where we deal with pud_t.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c2febafc67734a62196c1b9dfba926412d4077ba)

Conflicts:

include/linux/kasan.h
mm/kasan/kasan_init.c
mm/memory.c
mm/page_vma_mapped.c

Orabug: 25808647

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>

asm-generic: introduce <asm-generic/pgtable-nop4d.h>

Like with pgtable-nopud.h for 4-level paging, this new header is base
for converting an architectures to properly folded p4d_t level.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 048456dcf2c56ad6f6248e2899dda92fb6a613f6)

Conflicts:

include/asm-generic/tlb.h

Orabug: 25808647

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>

arch, mm: convert all architectures to use 5level-fixup.h

If an architecture uses 4level-fixup.h we don't need to do anything as
it includes 5level-fixup.h.

If an architecture uses pgtable-nop*d.h, define __ARCH_USE_5LEVEL_HACK
before inclusion of the header. It makes asm-generic code to use
5level-fixup.h.

If an architecture has 4-level paging or folds levels on its own,
include 5level-fixup.h directly.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 9849a5697d3defb2087cb6b9be5573a142697889)

Conflicts:

arch/arc/include/asm/hugepage.h
arch/h8300/include/asm/pgtable.h
arch/mips/include/asm/pgtable-64.h
arch/powerpc/include/asm/book3s/32/pgtable.h
arch/powerpc/include/asm/book3s/64/pgtable.h
arch/powerpc/include/asm/nohash/32/pgtable.h
arch/powerpc/include/asm/nohash/64/pgtable-4k.h
arch/powerpc/include/asm/nohash/64/pgtable-64k.h

Orabug: 25808647

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>

asm-generic: introduce __ARCH_USE_5LEVEL_HACK

We are going to introduce <asm-generic/pgtable-nop4d.h> to provide
abstraction for properly (in opposite to 5level-fixup.h hack) folded
p4d level. The new header will be included from pgtable-nopud.h.

If an architecture uses <asm-generic/nop*d.h>, we cannot use
5level-fixup.h directly to quickly convert the architecture to 5-level
paging as it would conflict with pgtable-nop4d.h.

With this patch an architecture can define __ARCH_USE_5LEVEL_HACK before
inclusion <asm-genenric/nop*d.h> to use 5level-fixup.h.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 30ec842660bd0d056d4a7028ac5bd4a82b113d4f)

Orabug: 25808647

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>

asm-generic: introduce 5level-fixup.h

We are going to switch core MM to 5-level paging abstraction.

This is preparation step which adds <asm-generic/5level-fixup.h>
As with 4level-fixup.h, the new header allows quickly make all
architectures compatible with 5-level paging in core MM.

In long run we would like to switch architectures to properly folded p4d
level by using <asm-generic/pgtable-nop4d.h>, but it requires more
changes to arch-specific code.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 505a60e225606fbd3d2eadc31ff793d939ba66f1)

Orabug: 25808647

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>

sparc64: prevent sunvdc from sending duplicate vdisk requests

prevent sunvdc from sending duplicate vdisk requests by ensuring that
inflight vdisk requests are resent before waking up suspended vdisk
threads

Orabug: 25866770

Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Reviewed-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

ldmvsw: stop the clean timer at beginning of remove

Stop the clean timer earlier to be sure there's no asynchronous
interference while stopping the port.

Orabug: 25748241

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Tushar Dave <tushar.n.dave@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: set CONFIG_EFI in config

Orabug: 26037358

Signed-off-by: Eric Snowberg <eric.snowberg@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: /sys/firmware/efi missing during EFI boot

The newest version of OBP is capable of doing an EFI boot. When Linux
is booted thru this EFI loader, the /sys/firmware/efi directory does
not exist. Many userspace applications, such as GRUB, check whether
the dir /sys/firmware/efi exists, if it exists it means
the kernel has booted in EFI mode.

A new Open Firmware property called efi-booter has been added
to /chosen. This new property is only present when doing an
EFI boot.

Signed-off-by: Eric Snowberg <eric.snowberg@oracle.com>
Reviewed-by Thomas Tai <thomas.tai@oracle.com>

Orabug: 26037358
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Allow default value of npools used for iommu to be configured from cmdline

    The default value of the number of pools used by the pooled IOMMU
    allocator  in lib/iommu-common.c is a constant today (set at 16).
    It is possible that, for some platforms and some devices, the combination
    of latency and frequency of  iommu alloc/free  requests may be such
    as to trigger fragmentation within a pool, leading to iommu alloc failure.

    Reducing the number of pools (and thus increasing the pool size) can
    minimize the risk of those failures.

    This patch provides a command line hook to set the default number of
    pools at boot time.

Ported to UEK4

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

SPARC64: Add Linux vds driver Device ID support for Solaris guest boot

Currently, Solaris guest backend disk images cannot be moved from the Device ID
they were created at and still boot. This bug fix adds Solaris Device ID
support to the Linux vds driver to allow a Solaris guest backend disk image to
be moved to a different device ID from where it was created and still boot.

The Linux vds driver support added in this bug is for Solaris disk images
only. In the future, Solaris Device ID support for physical disk backends will
be added to the Linux vds driver as well.

From PSARC/1995/352:
Solaris Device IDs provide a means for identifying a device, independent of the
device's current name or device number. The instance number of a device number
may change across reconfiguration boots, changing the device number (dev_t) for
that device. Operator errors in recabling can cause devices to swap logical
device names, introducing the potential for data loss.

Signed-off-by: George Kennedy <george.kennedy@oracle.com>
Reviewed-by: Alexandre Chartre <Alexandre.Chartre@oracle.com>
Orabug: 25836231
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Remove locking of huge pages in DAX driver

Orabug: 25968141

Some huge page virtual addresses do not work with get_user_pages. Since
the purpose of calling get_user_pages is for its locking side effect, it
is not at all necessary for huge pages since they are permanently
pinned. So the failure is avoided and the unnecessary locking/unlocking
is eliminated.

Reviewed-by: Jonathan Helman <jonathan.helman@oracle.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

ldmvsw: unregistering netdev before disable hardware

When running LDom binding/unbinding test, kernel may panic
in ldmvsw_open(). It is more likely that because we're removing
the ldc connection before unregistering the netdev in vsw_port_remove(),
we set up a window of time where one process could be removing the
device while another trying to UP the device. This also sometimes causes
vio handshake error due to opening a device without closing it completely.
We should unregister the netdev before we disable the "hardware".

orabug: 25980913, 25925306

Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

arch/sparc: Measure receiver forward progress to avoid send mondo timeout

A large sun4v SPARC system may have moments of intensive xcall activities,
usually caused by unmapping many pages on many CPUs concurrently. This can
flood receivers with CPU mondo interrupts for an extended period, causing
some unlucky senders to hit send-mondo timeout. This problem gets worse
as cpu count increases because sometimes mappings must be invalidated on
all CPUs, and sometimes all CPUs may gang up on a single CPU.

But a busy system is not a broken system. In the above scenario, as long
as the receiver is making forward progress processing mondo interrupts,
the sender should continue to retry.

This patch implements the receiver's forward progress meter by introducing
a per cpu counter 'cpu_mondo_counter[cpu]' where 'cpu' is in the range
of 0..NR_CPUS. The receiver increments its counter as soon as it receives
a mondo and the sender tracks the receiver's counter. Every 10000 retries,
if the receiver has stopped making forward progress, the sender declares
send-mondo-timeout and panic; otherwise, the receiver is allowed to keep
making forward progress.

Orabug: 25476541
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-By: Steve Sistare <steven.sistare@oracle.com>
Reviewed-By: Anthony Yznaga <anthony.yznaga@oracle.com>
Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: update DAX submit to latest HV spec

Orabug: 25927558

DAX submit needs to be updated to the latest HV spec. Along with a couple
small updates, the biggest modification is changing nomap_va to
status_data. This is mostly a cosmetic change but also adds support to
return the unavailable code via the exec ioctl. Further, augment the
comments and fix up a couple nits in the ccb submit hcall in hypervisor.h.

Signed-off-by: Jonathan Helman <jonathan.helman@oracle.com>
Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

arch/sparc: increase CONFIG_NODES_SHIFT on SPARC to 5

SPARC M6-32 platform has (2^5) numa nodes, so we need to bump up the
CONFIG_NODES_SHIFT to 5.

Orabug: 25577754

Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Reviewed-by: Atish Patra <atish.patra@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

arch/sparc: support NR_CPUS = 4096

Linux SPARC64 limits NR_CPUS to 4064 because init_cpu_send_mondo_info()
only allocates a single page for NR_CPUS mondo entries. Thus we cannot
use all 4096 CPUs on some SPARC platforms.

To fix, allocate (2^order) pages where order is set according to the size
of cpu_list for possible cpus. Since cpu_list_pa and cpu_mondo_block_pa
are not used in asm code, there are no imm13 offsets from the base PA
that will break because they can only reach one page.

Orabug: 25505750

Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Reviewed-by: Atish Patra <atish.patra@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

ipv6: catch a null skb before using it in a DTRACE

Fix a little trap set by an earlier DTRACE_IP patch. While I was there
I checked the other similar calls and the rest look okay.

Orabug: 25973797

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-By: Jane Chu <jane.chu@oracle.com>
Reviewed-By: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: fix fault handling in NGbzero.S and GENbzero.S

When any of the functions contained in NGbzero.S and GENbzero.S
are being run, we may end up taking a fault when executing one
of the store alternate address space instructions. If this
happens, the exception handler does not restore the %asi
register.

This commit fixes the issue by introducing a new exception
handler that ensures the %asi register is restored when
a fault is handled.

Orabug: 25577560

Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: modify sys_dax.h for new libdax

Orabug: 25927572

Modify sys_dax.h such that new libdax can be compiled by including this
file unmodified. Userspace does not have u16, u32, etc. types defined and
as stated in Section 5e of Documentation/CodingStyle, we should be using
__u16, __u32, etc. in the ioctl structures which are exported to userspace.

Further, rename the DAXIOC_DEP_[number] ioctls and use DAXIOC_[name]_OLD
instead.

Signed-off-by: Jonathan Helman <jonathan.helman@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

bnx2x: Align RX buffers

The bnx2x driver is not providing proper alignment on the receive buffers it
passes to build_skb(), causing skb_shared_info to be misaligned.
skb_shared_info contains an atomic, and while PPC normally supports
unaligned accesses, it does not support unaligned atomics.

Aligning the size of rx buffers will ensure that page_frag_alloc() returns
aligned addresses.

This can be reproduced on PPC by setting the network MTU to 1450 (or other
non-multiple-of-4) and then generating sufficient inbound network traffic
(one or two large "wget"s usually does it), producing the following oops:

Unable to handle kernel paging request for unaligned access at address 0xc00000ffc43af656
Faulting instruction address: 0xc00000000080ef8c
Oops: Kernel access of bad area, sig: 7 [#1]
SMP NR_CPUS=2048
NUMA
PowerNV
Modules linked in: vmx_crypto powernv_rng rng_core powernv_op_panel leds_powernv led_class nfsd ip_tables x_tables autofs4 xfs lpfc bnx2x mdio libcrc32c crc_t10dif crct10dif_generic crct10dif_common
CPU: 104 PID: 0 Comm: swapper/104 Not tainted 4.11.0-rc8-00088-g4c761da #2
task: c00000ffd4892400 task.stack: c00000ffd4920000
NIP: c00000000080ef8c LR: c00000000080eee8 CTR: c0000000001f8320
REGS: c00000ffffc33710 TRAP: 0600 Not tainted (4.11.0-rc8-00088-g4c761da)
MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
CR: 24082042 XER: 00000000
CFAR: c00000000080eea0 DAR: c00000ffc43af656 DSISR: 00000000 SOFTE: 1
GPR00: c000000000907f64 c00000ffffc33990 c000000000dd3b00 c00000ffcaf22100
GPR04: c00000ffcaf22e00 0000000000000000 0000000000000000 0000000000000000
GPR08: 0000000000b80008 c00000ffc43af636 c00000ffc43af656 0000000000000000
GPR12: c0000000001f6f00 c00000000fe1a000 000000000000049f 000000000000c51f
GPR16: 00000000ffffef33 0000000000000000 0000000000008a43 0000000000000001
GPR20: c00000ffc58a90c0 0000000000000000 000000000000dd86 0000000000000000
GPR24: c000007fd0ed10c0 00000000ffffffff 0000000000000158 000000000000014a
GPR28: c00000ffc43af010 c00000ffc9144000 c00000ffcaf22e00 c00000ffcaf22100
NIP [c00000000080ef8c] __skb_clone+0xdc/0x140
LR [c00000000080eee8] __skb_clone+0x38/0x140
Call Trace:
[c00000ffffc33990] [c00000000080fb74] skb_clone+0x74/0x110 (unreliable)
[c00000ffffc339c0] [c000000000907f64] packet_rcv+0x144/0x510
[c00000ffffc33a40] [c000000000827b64] __netif_receive_skb_core+0x5b4/0xd80
[c00000ffffc33b00] [c00000000082b2bc] netif_receive_skb_internal+0x2c/0xc0
[c00000ffffc33b40] [c00000000082c49c] napi_gro_receive+0x11c/0x260
[c00000ffffc33b80] [d000000066483d68] bnx2x_poll+0xcf8/0x17b0 [bnx2x]
[c00000ffffc33d00] [c00000000082babc] net_rx_action+0x31c/0x480
[c00000ffffc33e10] [c0000000000d5a44] __do_softirq+0x164/0x3d0
[c00000ffffc33f00] [c0000000000d60a8] irq_exit+0x108/0x120
[c00000ffffc33f20] [c000000000015b98] __do_irq+0x98/0x200
[c00000ffffc33f90] [c000000000027f14] call_do_irq+0x14/0x24
[c00000ffd4923a90] [c000000000015d94] do_IRQ+0x94/0x110
[c00000ffd4923ae0] [c000000000008d90] hardware_interrupt_common+0x150/0x160

Orabug: 25806778
Cherry-picked from 05c0d69d7

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

PCI: Fix unaligned accesses in VC code

The save/restore buffers for VC state is first composed of a 2-byte control
register, then a bunch of 4-byte words.

This causes unaligned accesses which trap on platform such as sparc.

This is easy to fix by simply moving the buffer pointer forward by 4 bytes
instead of 2 after dealing with the control register. The length
adjustment needs to be changed likewise as well.

Orabug: 25806778
Cherry-picked from b77b3610 PCI: Fix unaligned accesses in VC code

Fixes: 5f8fc43217a0 ("PCI: Include pci/pcie/Kconfig directly from pci/Kconfig")
Reported-by: Meelis Roos <mroos@linux.ee>
Reported-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
CC: stable@vger.kernel.org # v4.6+
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Use LOCKDEP_SMALL, not PROVE_LOCKING_SMALL

Orabug: 25830041

(Cherry-pick of upstream 395102db441abb8fd18fec5dd81428b5120232af)

CONFIG_PROVE_LOCKING_SMALL shrinks the memory usage of lockdep so the
kernel text, data, and bss fit in the locked TLB entries allotted for
the kernel, but this option is not set for every config that enables
lockdep.

A 4.10 kernel fails to boot with the console output

    Kernel: Using 8 locked TLB entries for main kernel image.
    hypervisor_tlb_lock[2000000:0:8000000071c007c3:1]: errors with f
    Program terminated

with these config options

    CONFIG_LOCKDEP=y
    CONFIG_LOCK_STAT=y
    CONFIG_PROVE_LOCKING=n

To fix, rename CONFIG_PROVE_LOCKING_SMALL to CONFIG_LOCKDEP_SMALL, and
enable this option with CONFIG_LOCKDEP=y so we get the reduced memory
usage every time lockdep is turned on.

Tested that CONFIG_LOCKDEP_SMALL is set to 'y' if and only if
CONFIG_LOCKDEP is set to 'y'.  When other lockdep-related config options
that select CONFIG_LOCKDEP are enabled (e.g. CONFIG_LOCK_STAT or
CONFIG_PROVE_LOCKING), verified that CONFIG_LOCKDEP_SMALL is also
enabled.

Fixes: 64740b06b7e5 ("config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc")
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

lockdep: Limit static allocations if PROVE_LOCKING_SMALL is defined

Reduce the size of data structure for lockdep entries by half if
PROVE_LOCKING_SMALL if defined. This is used only for sparc.

Orabug: 24736954

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Acked-by: Sam Ravnborg <sam@ravnborg.org>

config: Adding the new config parameter CONFIG_PROVE_LOCKING_SMALL for sparc

This new config parameter limits the space used for "Lock debugging:
prove locking correctness" by about 4MB. The current sparc systems have
the limitation of 32MB size for kernel size including .text, .data and
.bss sections. With PROVE_LOCKING feature, the kernel size could grow
beyond this limit and causing system boot-up issues. With this option,
kernel limits the size of the entries of lock_chains, stack_trace etc.
so that kernel fits in required size limit. This is not visible to user
and only used for sparc.

Orabug: 24736954

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Acked-by: Sam Ravnborg <sam@ravnborg.org>

sparc64: fix cdev_put() use-after-free when unbinding an LDom

After turning on slub_debug=P kernel option, a kernel panic happens when
unbinding an LDom. This suggests that there is memory corruption.
The memory corruption is caused by vlds_fops_release() freeing a memory
structure containing a cdev. The cdev is needed by fs/file_table.c
after the file is released.

The common approach to solve this issue is to add a kobject member
in the structure and set it to be the parent of cdev. The kobject is
then responsible to free the structure when the reference count is
zero. The reference solution is based on the following patch.

https://patchwork.kernel.org/patch/8985881/

Orabug: 25911389

Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Reviewed-By: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Tom Saeger <tom.saeger@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: change DAX CCB_EXEC ENOBUFS print to debug

Orabug: 25927528

The CCB_EXEC ioctl in the DAX driver returns ENOBUFS when the user must
free completion areas before the submission can succeed. There is a
dax_err() print when this condition occurs. This print should be changed to
a dax_dbg() print since this return value can be used by the caller to
trigger freeing the completion areas, hence an error print is too verbose.

Signed-off-by: Jonathan Helman <jonathan.helman@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

xen-netback: copy buffer on xenvif_start_xmit

Normally packets are enqueued from ndo_start_xmit into rx_queue
which is internal to netback. The guestrx thread will then pick
these up, create the grant copy ops (while coaslescing them as
much) and notify frontend. Although most packets now endup being
memcpy-ed directly from netback (instead of through Xen). As a result
guestrx thread ends up waiting more (and woken up by transmit function)
which leads to higher contention on the wait queue as seen with
lock_stat:

class name    con-bounces    contentions   waittime-min   waittime-max
waittime-total   waittime-avg    acq-bounces   acquisitions   holdtime-min
holdtime-max holdtime-total   holdtime-avg
--------------------------------------------------------------------------
&queue->wq:   792            792           0.36          24.36
1140.30           1.44           4208        1002671           0.00
46.75      538164.02           0.54
----------
&queue->wq    326          [<ffffffff8115949f>] __wake_up+0x2f/0x80
&queue->wq    410          [<ffffffff811592bf>] finish_wait+0x4f/0xa0
&queue->wq     56          [<ffffffff811593eb>] prepare_to_wait+0x2b/0xb0
----------
&queue->wq    202          [<ffffffff811593eb>] prepare_to_wait+0x2b/0xb0
&queue->wq    467          [<ffffffff8115949f>] __wake_up+0x2f/0x80
&queue->wq    123          [<ffffffff811592bf>] finish_wait+0x4f/0xa0

with staging grants:

&queue->wq:   61834          61836           0.32          30.12
99710.27           1.61         241400        1125308           0.00
75.61     1106578.82           0.98
----------
&queue->wq     5079        [<ffffffff8115949f>] __wake_up+0x2f/0x80
&queue->wq    56280        [<ffffffff811592bf>] finish_wait+0x4f/0xa0
&queue->wq      479        [<ffffffff811593eb>] prepare_to_wait+0x2b/0xb0
----------
&queue->wq     1005        [<ffffffff811592bf>] finish_wait+0x4f/0xa0
&queue->wq    56761        [<ffffffff8115949f>] __wake_up+0x2f/0x80
&queue->wq     4072        [<ffffffff811593eb>] prepare_to_wait+0x2b/0xb0

To avoid such contention we copy the packet directly on ndo_start_xmit
which avoid the kicking of this thread. Although with recycling of grants
it is *not* fully guaranteed that all copies will be done by netback, and
as a result handling all packets on start_xmit could potentially lead to
a very high number of hypercalls per packet and therefore affect
throughput. Only with a copy frontend would we have this guarantee. Hence
for now we hide this behind a module parameter (skip_guestrx_thread)
which by default is false.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 26107942

xen-netback: slightly rework xenvif_rx_skb

This way we can reuse xenvif_rx_skb when transmiting
an skb that it's not taken from the internal guestrx queue.
We therefore isolate that in xenvif_rx_action on guestrx
context, and all its usage of the completed queue.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 26107942

xen-netfront: introduce rx copy mode

This allows us to not rely on recycling opportunities for cases
where pages recycling isn't effective and/or workloads prove to
be faster with copying onto new pages.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 26107942

xen-netfront: use gref mappings for Tx buffers

This is done in order to allow user to avoid grant paths,
by resorting to copying instead. Since we aren't guaranteed
to reuse pages on stack transmit, hence we can only reliably
copy to the Tx granted pages. Hence we keep a shadow pool and
use that instead. We don't require coalescing Tx requests, as
we already linearize the skb when required slots > available
slots.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 26107942

xen-netfront: generalize recycling for grants

Takes the already existent mechanism for recycling pages and leverages it
for grant references too. The difference though is that pages permanently
granted to the backend cannot be revoked (because those are mapped by the
other side) and hence these need to go to a separate quarantine pool, until
the point these pages can be consumed. The strategy is: 1) Get a page by
fetching oldest entry in rx_pool 2) If it's not granted then the page is
freed at the head 3) if it's reusable return the page otherwise add it to
quarantine pool 4) fetch oldest entry in quarantine pool and finally 5) if
all else fails then we resort to allocating a new page. Worst case scenario
if we have two atomic read op added on packet path when allocating a new
page for Rx requests.

This page reuse strategy allows us to remove a copy for each page handed
over by the backend leveraging guest RX performance to ~42-47 Gbit/s when
testing backend -> frontend. The measured recycling percentage is about
30% on TCP streams if pool size == ring size; and with pool size == 2 *
ring size these rises up to 80 - 100%. This shows that bigger ring sizes
should allow for better recycling, which remains to be explored.

The only downside of this approach is that it is not 100% guaranteed that
the Rx requests provided to the backend will be already mapped; in other
words, backend may need to do a grant copy on 1% of the packets.
This is not the case though when we are in full copy mode whereby we always
reuse the same grants while copying into new pages into the upper layers.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 26107942

xen-netfront: add rx page statistics

Add three new counters namely rx_alloc_pages, rx_alloc_failed_pages
and rx_packet_pages such that we can observe how many packets hit
the recyling path (or otherwise).

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 26107942

xen-netfront: introduce rx page recyling

Recycling pages lets us avoid the page allocator when possible, as
similar approach followed by ixgbe and mlx{4,5} drivers. Introduce
a small buffer pool tracking outstanding pages. We increase page
refcount by 1 to avoid stack freeing the page in upper layers. Recycling
of pages is then possible on inflight skbs, by the time we process N
requests by the stack and thus when allocating new Rx requests we
attempting at reusing the oldest page in the pool if and only if
page._refcount is 1. Otherwise we just decrement the refcount (on
free_page).

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 26107942

xen-netfront: move rx_gso_checksum_fixup into netfront_stats

It allows us to remove one atomic op (on a very rare case) and
further allow easier adding of new statistics.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 26107942

xen-netfront: introduce staging gref pools

Grant buffers and allow backend to permanently map these grants
through the control messages newly added. This only happens if
the backend advertises "feature-staging-grants".

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 26107942

xen-netback: use gref mappings for Tx requests

Introduces grants already mapped (by control ring request of the guest)
for TX path which follows similar code path as the grant mapping.

It starts by checking if there's a grant available for header
and frags grefs and if so setting it in tx_grants. If no gref mapping
is found in the tree for the header it will resort to grant copy. For the
frags it will perform a gref lookup on the mapping table, and in case of
no entry is found it falls back to grant map/unmap using mmap_pages. When
skb destructor callback gets called we release the slot and the grant
within the callback to avoid waking up the dealloc thread. As long as there
are no unmaps to be done the dealloc thread will remain inactive.

Results show an improvement of 46% (3.6 vs 1.24 Mpps, 64 pkt size)
measured with pktgen and up to over 48% (28 vs 14.5 Gbit/s) measured
with iperf3 2 queue vif, DomU to Dom0. Measured too with sendfile()
and it goes further up to 35.3 Gbit/s given the lack of a second copy.
Tests run locally on a Intel Xeon CPU E5-2699 v3 with HT disabled,
Dom0 <-> DomU.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 26107942

xen-netback: use gref mappings for Rx requests

First lookup in the frontend gref mapping table to see whether
the requested gref is already mapped and has the right permissions.
If so, use that instead.

Results are 2.04 Mpps measured with pktgen (pkt_size 64, burst 1)
with already mapped grants versus half of it with grant copy.
Fundamentally it works in the same way as grants, it just avoids
asking Xen to copy the page, and hence opening room for other
improvements.

For example with the mapped grefs it further adds up contention on
queue->wq as the kthread_guest_rx goes to sleep more often. We can
alternatively copy the skb on xenvif_start_xmit() instead of going
through the RX kthread. It would only be beneficial if guest would
*only* use the mapped grants (either by copying or recycling mechanisms)
otherwise it would significantly add up the added cost of a grant copy
hypercall per packet.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 26107942

xen-netback: shorten tx grant copy

Refactors grant copy setup on Transmit side and fit into a helper.
Further commits will allow this routine to memcpy from a premapped
page.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 26107942

xen-netback: introduce staging grant mappings ops

Introduce support for staging grants which means having a
set of preallocated buffers that get reused over time. This is
negotiated through a couple of xenstore entries in the form of:

* /local/domain/1/device/vif/0/queue-0 = ""
* /local/domain/1/device/vif/0/queue-0/tx-pool-ref = "<ring-ref-tx0>"
* /local/domain/1/device/vif/0/queue-0/tx-pool-size = "<nr-entries-tx0>"
* /local/domain/1/device/vif/0/queue-0/rx-pool-ref = "<ring-ref-rx0>"
* /local/domain/1/device/vif/0/queue-0/rx-pool-size = "<nr-entries-rx0>"

These entries will hand over a list of `struct xen_ext_gref_alloc` which
frontend provide (size of XEN_PAGE_SIZE which fits 512 entries). And
these entries contain the gref and flags to map into a Domain-0
ballooned page, which gets added in a hash table of gref <-> backing
page kept per queue. Frontend can use this to pregrant certain pages and
reuse them for Rx/Tx requests.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 26107942

include/xen: import vendor extension to netif.h

Describe in the protocol headers the extension we're making
with respect to staging grants. The extensions here described
are a middle ground with what is being discussed upstream
while keeping similar (yet different naming) structures
to be proposed upstream. The difference with upstream proposal
is that the staging grants occurs through a control ring;
here we do at xenbus features negotiation, which is more
maintainable while we keep this code out of tree.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 26107942