www.infradead.org Git - users/jedix/linux-maple.git/log

xen/blkfront: correct setting for xen_blkif_max_ring_order

According to this piece code:
"
pr_info("Invalid max_ring_order (%d), will use default max: %d.\n",
xen_blkif_max_ring_order, XENBUS_MAX_RING_GRANT_ORDER);
"
if xen_blkif_max_ring_order is bigger that XENBUS_MAX_RING_GRANT_ORDER,
need to set xen_blkif_max_ring_order using XENBUS_MAX_RING_GRANT_ORDER,
but not 0.

Signed-off-by: Peng Fan <van.freenix@gmail.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 45fc82642e54018740a25444d1165901501b601b)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkfront: make persistent grants pool per-queue

Make persistent grants per-queue/ring instead of per-device, so that we can
drop the 'dev_lock' and get better scalability.

Test was done based on null_blk driver:
dom0: v4.2-rc8 16vcpus 10GB "modprobe null_blk"
domu: v4.2-rc8 16vcpus 10GB

[test]
rw=read
direct=1
ioengine=libaio
bs=4k
time_based
runtime=30
filename=/dev/xvdb
numjobs=16
iodepth=64
iodepth_batch=64
iodepth_batch_complete=64
group_reporting

Queues: 1 4 8 16
Iops orig(k): 810 1064 780 700
Iops patched(k): 810 1230(~20%) 1024(~20%) 850(~20%)

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 73716df7da4f60dd2d59a9302227d0394f1b8fcc)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkfront: Remove duplicate setting of ->xbdev.

We do the same exact operations a bit earlier in the
function.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 75f070b3967b0c3bf0e1bc43411b06bab6c2c2cd)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkfront: Cleanup of comments, fix unaligned variables, and syntax errors.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 6f03a7ff89485f0a7a559bf5c7631d2986c4ecfa)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkfront: negotiate number of queues/rings to be used with backend

The max number of hardware queues for xen/blkfront is set by parameter
'max_queues'(default 4), while it is also capped by the max value that the
xen/blkback exposes through XenStore key 'multi-queue-max-queues'.

The negotiated number is the smaller one and would be written back to xenstore
as "multi-queue-num-queues", blkback needs to read this negotiated number.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 28d949bcc28bbc2d206f9c3f69b892575e81c040)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkfront: split per device io_lock

After patch "xen/blkfront: separate per ring information out of device
info", per-ring data is protected by a per-device lock ('io_lock').

This is not a good way and will effect the scalability, so introduce a
per-ring lock ('ring_lock').

The old 'io_lock' is renamed to 'dev_lock' which protects the ->grants list and
->persistent_gnts_c which are shared by all rings.

Note that in 'blkfront_probe' the 'blkfront_info' is setup via kzalloc
so setting ->persistent_gnts_c to zero is not needed.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 11659569f7202d0cb6553e81f9b8aa04dfeb94ce)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkfront: pseudo support for multi hardware queues/rings

Preparatory patch for multiple hardware queues (rings). The number of
rings is unconditionally set to 1, larger number will be enabled in
patch "xen/blkfront: negotiate number of queues/rings to be used with backend"
so as to make review easier.

Note that blkfront_gather_backend_features does not call
blkfront_setup_indirect anymore (as that needs to be done per ring).
That means that in blkif_recover/blkif_connect we have to do it in a loop
(bounded by nr_rings).

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 3df0e5059908b8fdba351c4b5dd77caadd95a949)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkfront: separate per ring information out of device info

Split per ring information to a new structure "blkfront_ring_info".

A ring is the representation of a hardware queue, every vbd device can associate
with one or more rings depending on how many hardware queues/rings to be used.

This patch is a preparation for supporting real multi hardware queues/rings.

We also add a backpointer to 'struct blkfront_info' (dev_info) which
is not needed (we could use containers_of) but further patch
("xen/blkfront: pseudo support for multi hardware queues/rings")
will make allocation of 'blkfront_ring_info' dynamic.

Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 81f351615772365d46ceeac3e50c9dd4e8f9dc89)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Conflicts:
drivers/block/xen-blkfront.c

xen/blkif: document blkif multi-queue/ring extension

Document the multi-queue/ring feature in terms of XenStore keys to be written by
the backend and by the frontend.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit eb5df87fab0ae7114b83dc7f338b27d039374767)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

x86/xen: don't reset vcpu_info on a cancelled suspend

On a cancelled suspend the vcpu_info location does not change (it's
still in the per-cpu area registered by xen_vcpu_setup()). So do not
call xen_hvm_init_shared_info() which would make the kernel think its
back in the shared info. With the wrong vcpu_info, events cannot be
received and the domain will hang after a cancelled suspend.

Signed-off-by: Charles Ouyang <ouyangzhaowei@huawei.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 6a1f513776b78c994045287073e55bae44ed9f8c)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Conflicts:
arch/x86/xen/suspend.c

xen/gntdev: constify mmu_notifier_ops structures

This mmu_notifier_ops structure is never modified, so declare it as
const, like the other mmu_notifier_ops structures.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit b9c0a92a9aa953e5a98f2af2098c747d4358c7bb)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/grant-table: constify gnttab_ops structure

The gnttab_ops structure is never modified, so declare it as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 86fc2136736d2767bf797e6d2b1f80b49f52953c)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/time: use READ_ONCE

Use READ_ONCE through the code, rather than explicit barriers.

Suggested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 2dd887e32175b624375570a0361083eb2cd64a07)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/x86: convert remaining timespec to timespec64 in xen_pvclock_gtod_notify

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
(cherry picked from commit 187b26a97244b1083d573175650f41b2267ac635)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/x86: support XENPF_settime64

Try XENPF_settime64 first, if it is not available fall back to
XENPF_settime32.

No need to call __current_kernel_time() when all the info needed are
already passed via the struct timekeeper * argument.

Return NOTIFY_BAD in case of errors.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
(cherry picked from commit 760968631323f710ea0824369bbd65f812c82f08)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen: introduce XENPF_settime64

Rename the current XENPF_settime hypercall and related struct to
XENPF_settime32.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
(cherry picked from commit f3d6027ee0568b5442077120beeb5d9d17c2d0da)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen: rename dom0_op to platform_op

The dom0_op hypercall has been renamed to platform_op since Xen 3.2,
which is ancient, and modern upstream Linux kernels cannot run as dom0
and it anymore anyway.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
(cherry picked from commit cfafae940381207d48b11a73a211142dba5947d3)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen: move xen_setup_runstate_info and get_runstate_snapshot to drivers/xen/time.c

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 4ccefbe597392d2914cf7ad904e33c734972681d)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Conflicts:
arch/x86/xen/time.c

x86/paravirt: Remove paravirt ops pmd_update[_defer] and pte_update_defer

pte_update_defer can be removed as it is always set to the same
function as pte_update. So any usage of pte_update_defer() can be
replaced by pte_update().

pmd_update and pmd_update_defer are always set to paravirt_nop, so they
can just be nuked.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: jeremy@goop.org
Cc: chrisw@sous-sol.org
Cc: akataria@vmware.com
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xen.org
Cc: konrad.wilk@oracle.com
Cc: david.vrabel@citrix.com
Cc: boris.ostrovsky@oracle.com
Link: http://lkml.kernel.org/r/1447771879-1806-1-git-send-email-jgross@suse.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit d6ccc3ec95251d8d3276f2900b59cbc468dd74f4)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

x86/paravirt: Remove unused pv_apic_ops structure

The only member of that structure is startup_ipi_hook which is always
set to paravirt_nop.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: jeremy@goop.org
Cc: chrisw@sous-sol.org
Cc: akataria@vmware.com
Cc: rusty@rustcorp.com.au
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xen.org
Cc: konrad.wilk@oracle.com
Cc: boris.ostrovsky@oracle.com
Link: http://lkml.kernel.org/r/1447767872-16730-1-git-send-email-jgross@suse.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 460958659270b7d750d4ccfe052171cb6f655cbb)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen-blkback: read from indirect descriptors only once

Since indirect descriptors are in memory shared with the frontend, the
frontend could alter the first_sect and last_sect values after they have
been validated but before they are recorded in the request. This may
result in I/O requests that overflow the foreign page, possibly
overwriting local pages when the I/O request is executed.

When parsing indirect descriptors, only read first_sect and last_sect
once.

This is part of XSA155.

(cherry-pick from 18779149101c0dd43ded43669ae2a92d21b6f9cb)
CC: stable@vger.kernel.org
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
----
v2: This is against v4.3

xen/pcifront: Report the errors better.

The messages should be different depending on the type of error.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 2cfec6a2f989d5c921ba11a329ff8ea986702b9b)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23017418 - Backport Linux v4.4 Xen patches

xen/pcifront: Fix mysterious crashes when NUMA locality information was extracted.

Occasionaly PV guests would crash with:

pciback 0000:00:00.1: Xen PCI mapped GSI0 to IRQ16
BUG: unable to handle kernel paging request at 0000000d1a8c0be0
.. snip..
  <ffffffff8139ce1b>] find_next_bit+0xb/0x10
  [<ffffffff81387f22>] cpumask_next_and+0x22/0x40
  [<ffffffff813c1ef8>] pci_device_probe+0xb8/0x120
  [<ffffffff81529097>] ? driver_sysfs_add+0x77/0xa0
  [<ffffffff815293e4>] driver_probe_device+0x1a4/0x2d0
  [<ffffffff813c1ddd>] ? pci_match_device+0xdd/0x110
  [<ffffffff81529657>] __device_attach_driver+0xa7/0xb0
  [<ffffffff815295b0>] ? __driver_attach+0xa0/0xa0
  [<ffffffff81527622>] bus_for_each_drv+0x62/0x90
  [<ffffffff8152978d>] __device_attach+0xbd/0x110
  [<ffffffff815297fb>] device_attach+0xb/0x10
  [<ffffffff813b75ac>] pci_bus_add_device+0x3c/0x70
  [<ffffffff813b7618>] pci_bus_add_devices+0x38/0x80
  [<ffffffff813dc34e>] pcifront_scan_root+0x13e/0x1a0
  [<ffffffff817a0692>] pcifront_backend_changed+0x262/0x60b
  [<ffffffff814644c6>] ? xenbus_gather+0xd6/0x160
  [<ffffffff8120900f>] ? put_object+0x2f/0x50
  [<ffffffff81465c1d>] xenbus_otherend_changed+0x9d/0xa0
  [<ffffffff814678ee>] backend_changed+0xe/0x10
  [<ffffffff81463a28>] xenwatch_thread+0xc8/0x190
  [<ffffffff810f22f0>] ? woken_wake_function+0x10/0x10

which was the result of two things:

When we call pci_scan_root_bus we would pass in 'sd' (sysdata)
pointer which was an 'pcifront_sd' structure. However in the
pci_device_add it expects that the 'sd' is 'struct sysdata' and
sets the dev->node to what is in sd->node (offset 4):

set_dev_node(&dev->dev, pcibus_to_node(bus));

__pcibus_to_node(const struct pci_bus *bus)
{
        const struct pci_sysdata *sd = bus->sysdata;

        return sd->node;
}

However our structure was pcifront_sd which had nothing at that
offset:

struct pcifront_sd {
        int                        domain;    /*     0     4 */
        /* XXX 4 bytes hole, try to pack */
        struct pcifront_device *   pdev;      /*     8     8 */
}

That is an hole - filled with garbage as we used kmalloc instead of
kzalloc (the second problem).

This patch fixes the issue by:
1) Use kzalloc to initialize to a well known state.
2) Put 'struct pci_sysdata' at the start of 'pcifront_sd'. That
    way access to the 'node' will access the right offset.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 4d8c8bd6f2062c9988817183a91fe2e623c8aa5e)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug:  23017418 - Backport Linux v4.4 Xen patches

xen/pciback: Save the number of MSI-X entries to be copied later.

Commit 8135cf8b092723dbfcc611fe6fdcb3a36c9951c5 (xen/pciback: Save
xen_pci_op commands before processing it) broke enabling MSI-X because
it would never copy the resulting vectors into the response. The
number of vectors requested was being overwritten by the return value
(typically zero for success).

Save the number of vectors before processing the op, so the correct
number of vectors are copied afterwards.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit d159457b84395927b5a52adb72f748dd089ad5e5)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23017418 - Backport Linux v4.4 Xen patches

xen/pciback: Check PF instead of VF for PCI_COMMAND_MEMORY

Commit 408fb0e5aa7fda0059db282ff58c3b2a4278baa0 (xen/pciback: Don't
allow MSI-X ops if PCI_COMMAND_MEMORY is not set) prevented enabling
MSI-X on passed-through virtual functions, because it checked the VF
for PCI_COMMAND_MEMORY but this is not a valid bit for VFs.

Instead, check the physical function for PCI_COMMAND_MEMORY.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 8d47065f7d1980dde52abb874b301054f3013602)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23017418 - Backport Linux v4.4 Xen patches

Merge branch 'linux-4.1/4.4-xen-backport' of git://ca-git.us.oracle.com/linux-joaomart-public into uek4/4.4-xen-backport

* 'linux-4.1/4.4-xen-backport' of git://ca-git.us.oracle.com/linux-joaomart-public: (113 commits)
  arch/x86/xen/suspend.c: include xen/xen.h
  x86/paravirt: Prevent rtc_cmos platform device init on PV guests
  xen-pciback: fix up cleanup path when alloc fails
  xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set.
  xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled.
  xen/pciback: Do not install an IRQ handler for MSI interrupts.
  xen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or MSI-X enabled
  xen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or MSI-X enabled
  xen/pciback: Save xen_pci_op commands before processing it
  xen-scsiback: safely copy requests
  xen-blkback: read from indirect descriptors only once
  xen-blkback: only read request operation from shared ring once
  xen-netback: use RING_COPY_REQUEST() throughout
  xen-netback: don't use last request to determine minimum Tx credit
  xen: Add RING_COPY_REQUEST()
  xen/x86/pvh: Use HVM's flush_tlb_others op
  xen: Resume PMU from non-atomic context
  xen/events/fifo: Consume unprocessed events when a CPU dies
  xen/evtchn: dynamically grow pending event channel ring
  xen/gntdev: Grant maps should not be subject to NUMA balancing
  ...

Backport from Linux v4.4

OraBug: 23017418 - Backport Linux v4.4 Xen patches

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

arch/x86/xen/suspend.c: include xen/xen.h

Fix the build warning:

  arch/x86/xen/suspend.c: In function 'xen_arch_pre_suspend':
  arch/x86/xen/suspend.c:70:9: error: implicit declaration of function 'xen_pv_domain' [-Werror=implicit-function-declaration]
          if (xen_pv_domain())
              ^

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit facca61683f937f31f90307cc64851436c8a3e21)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

x86/paravirt: Prevent rtc_cmos platform device init on PV guests

Adding the rtc platform device in non-privileged Xen PV guests causes
an IRQ conflict because these guests do not have legacy PIC and may
allocate irqs in the legacy range.

In a single VCPU Xen PV guest we should have:

/proc/interrupts:
           CPU0
  0:       4934  xen-percpu-virq      timer0
  1:          0  xen-percpu-ipi       spinlock0
  2:          0  xen-percpu-ipi       resched0
  3:          0  xen-percpu-ipi       callfunc0
  4:          0  xen-percpu-virq      debug0
  5:          0  xen-percpu-ipi       callfuncsingle0
  6:          0  xen-percpu-ipi       irqwork0
  7:        321   xen-dyn-event     xenbus
  8:         90   xen-dyn-event     hvc_console
  ...

But hvc_console cannot get its interrupt because it is already in use
by rtc0 and the console does not work.

  genirq: Flags mismatch irq 8. 00000000 (hvc_console) vs. 00000000 (rtc0)

We can avoid this problem by realizing that unprivileged PV guests (both
Xen and lguests) are not supposed to have rtc_cmos device and so
adding it is not necessary.

Privileged guests (i.e. Xen's dom0) do use it but they should not have
irq conflicts since they allocate irqs above legacy range (above
gsi_top, in fact).

Instead of explicitly testing whether the guest is privileged we can
extend pv_info structure to include information about guest's RTC
support.

Reported-and-tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: vkuznets@redhat.com
Cc: xen-devel@lists.xenproject.org
Cc: konrad.wilk@oracle.com
Cc: stable@vger.kernel.org # 4.2+
Link: http://lkml.kernel.org/r/1449842873-2613-1-git-send-email-boris.ostrovsky@oracle.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit d8c98a1d1488747625ad6044d423406e17e99b7a)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen-pciback: fix up cleanup path when alloc fails

When allocating a pciback device fails, clear the private
field. This could lead to an use-after free, however
the 'really_probe' takes care of setting
dev_set_drvdata(dev, NULL) in its failure path (which we would
exercise if the ->probe function failed), so we we
are OK. However lets be defensive as the code can change.

Going forward we should clean up the pci_set_drvdata(dev, NULL)
in the various code-base. That will be for another day.

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reported-by: Jonathan Creekmore <jonathan.creekmore@gmail.com>
Signed-off-by: Doug Goldstein <cardoe@cardoe.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 584a561a6fee0d258f9ca644f58b73d9a41b8a46)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set.

commit f598282f51 ("PCI: Fix the NIU MSI-X problem in a better way")
teaches us that dealing with MSI-X can be troublesome.

Further checks in the MSI-X architecture shows that if the
PCI_COMMAND_MEMORY bit is turned of in the PCI_COMMAND we
may not be able to access the BAR (since they are memory regions).

Since the MSI-X tables are located in there.. that can lead
to us causing PCIe errors. Inhibit us performing any
operation on the MSI-X unless the MEMORY bit is set.

Note that Xen hypervisor with:
"x86/MSI-X: access MSI-X table only after having enabled MSI-X"
will return:
xen_pciback: 0000:0a:00.1: error -6 enabling MSI-X for guest 3!

When the generic MSI code tries to setup the PIRQ without
MEMORY bit set. Which means with later versions of Xen
(4.6) this patch is not neccessary.

This is part of XSA-157

CC: stable@vger.kernel.org
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 408fb0e5aa7fda0059db282ff58c3b2a4278baa0)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled.

Otherwise just continue on, returning the same values as
previously (return of 0, and op->result has the PIRQ value).

This does not change the behavior of XEN_PCI_OP_disable_msi[|x].

The pci_disable_msi or pci_disable_msix have the checks for
msi_enabled or msix_enabled so they will error out immediately.

However the guest can still call these operations and cause
us to disable the 'ack_intr'. That means the backend IRQ handler
for the legacy interrupt will not respond to interrupts anymore.

This will lead to (if the device is causing an interrupt storm)
for the Linux generic code to disable the interrupt line.

Naturally this will only happen if the device in question
is plugged in on the motherboard on shared level interrupt GSI.

This is part of XSA-157

CC: stable@vger.kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 7cfb905b9638982862f0331b36ccaaca5d383b49)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/pciback: Do not install an IRQ handler for MSI interrupts.

Otherwise an guest can subvert the generic MSI code to trigger
an BUG_ON condition during MSI interrupt freeing:

for (i = 0; i < entry->nvec_used; i++)
BUG_ON(irq_has_action(entry->irq + i));

Xen PCI backed installs an IRQ handler (request_irq) for
the dev->irq whenever the guest writes PCI_COMMAND_MEMORY
(or PCI_COMMAND_IO) to the PCI_COMMAND register. This is
done in case the device has legacy interrupts the GSI line
is shared by the backend devices.

To subvert the backend the guest needs to make the backend
to change the dev->irq from the GSI to the MSI interrupt line,
make the backend allocate an interrupt handler, and then command
the backend to free the MSI interrupt and hit the BUG_ON.

Since the backend only calls 'request_irq' when the guest
writes to the PCI_COMMAND register the guest needs to call
XEN_PCI_OP_enable_msi before any other operation. This will
cause the generic MSI code to setup an MSI entry and
populate dev->irq with the new PIRQ value.

Then the guest can write to PCI_COMMAND PCI_COMMAND_MEMORY
and cause the backend to setup an IRQ handler for dev->irq
(which instead of the GSI value has the MSI pirq). See
'xen_pcibk_control_isr'.

Then the guest disables the MSI: XEN_PCI_OP_disable_msi
which ends up triggering the BUG_ON condition in 'free_msi_irqs'
as there is an IRQ handler for the entry->irq (dev->irq).

Note that this cannot be done using MSI-X as the generic
code does not over-write dev->irq with the MSI-X PIRQ values.

The patch inhibits setting up the IRQ handler if MSI or
MSI-X (for symmetry reasons) code had been called successfully.

P.S.
Xen PCIBack when it sets up the device for the guest consumption
ends up writting 0 to the PCI_COMMAND (see xen_pcibk_reset_device).
XSA-120 addendum patch removed that - however when upstreaming said
addendum we found that it caused issues with qemu upstream. That
has now been fixed in qemu upstream.

This is part of XSA-157

CC: stable@vger.kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit a396f3a210c3a61e94d6b87ec05a75d0be2a60d0)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or MSI-X enabled

The guest sequence of:

a) XEN_PCI_OP_enable_msix
b) XEN_PCI_OP_enable_msix

results in hitting an NULL pointer due to using freed pointers.

The device passed in the guest MUST have MSI-X capability.

The a) constructs and SysFS representation of MSI and MSI groups.
The b) adds a second set of them but adding in to SysFS fails (duplicate entry).
'populate_msi_sysfs' frees the newly allocated msi_irq_groups (note that
in a) pdev->msi_irq_groups is still set) and also free's ALL of the
MSI-X entries of the device (the ones allocated in step a) and b)).

The unwind code: 'free_msi_irqs' deletes all the entries and tries to
delete the pdev->msi_irq_groups (which hasn't been set to NULL).
However the pointers in the SysFS are already freed and we hit an
NULL pointer further on when 'strlen' is attempted on a freed pointer.

The patch adds a simple check in the XEN_PCI_OP_enable_msix to guard
against that. The check for msi_enabled is not stricly neccessary.

This is part of XSA-157

CC: stable@vger.kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 5e0ce1455c09dd61d029b8ad45d82e1ac0b6c4c9)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or MSI-X enabled

The guest sequence of:

a) XEN_PCI_OP_enable_msi
b) XEN_PCI_OP_enable_msi
c) XEN_PCI_OP_disable_msi

results in hitting an BUG_ON condition in the msi.c code.

The MSI code uses an dev->msi_list to which it adds MSI entries.
Under the above conditions an BUG_ON() can be hit. The device
passed in the guest MUST have MSI capability.

The a) adds the entry to the dev->msi_list and sets msi_enabled.
The b) adds a second entry but adding in to SysFS fails (duplicate entry)
and deletes all of the entries from msi_list and returns (with msi_enabled
is still set). c) pci_disable_msi passes the msi_enabled checks and hits:

BUG_ON(list_empty(dev_to_msi_list(&dev->dev)));

and blows up.

The patch adds a simple check in the XEN_PCI_OP_enable_msi to guard
against that. The check for msix_enabled is not stricly neccessary.

This is part of XSA-157.

CC: stable@vger.kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 56441f3c8e5bd45aab10dd9f8c505dd4bec03b0d)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/pciback: Save xen_pci_op commands before processing it

Double fetch vulnerabilities that happen when a variable is
fetched twice from shared memory but a security check is only
performed the first time.

The xen_pcibk_do_op function performs a switch statements on the op->cmd
value which is stored in shared memory. Interestingly this can result
in a double fetch vulnerability depending on the performed compiler
optimization.

This patch fixes it by saving the xen_pci_op command before
processing it. We also use 'barrier' to make sure that the
compiler does not perform any optimization.

This is part of XSA155.

CC: stable@vger.kernel.org
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 8135cf8b092723dbfcc611fe6fdcb3a36c9951c5)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen-scsiback: safely copy requests

The copy of the ring request was lacking a following barrier(),
potentially allowing the compiler to optimize the copy away.

Use RING_COPY_REQUEST() to ensure the request is copied to local
memory.

This is part of XSA155.

CC: stable@vger.kernel.org
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit be69746ec12f35b484707da505c6c76ff06f97dc)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen-blkback: read from indirect descriptors only once

Since indirect descriptors are in memory shared with the frontend, the
frontend could alter the first_sect and last_sect values after they have
been validated but before they are recorded in the request. This may
result in I/O requests that overflow the foreign page, possibly
overwriting local pages when the I/O request is executed.

When parsing indirect descriptors, only read first_sect and last_sect
once.

This is part of XSA155.

CC: stable@vger.kernel.org
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 18779149101c0dd43ded43669ae2a92d21b6f9cb)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen-blkback: only read request operation from shared ring once

A compiler may load a switch statement value multiple times, which could
be bad when the value is in memory shared with the frontend.

When converting a non-native request to a native one, ensure that
src->operation is only loaded once by using READ_ONCE().

This is part of XSA155.

CC: stable@vger.kernel.org
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 1f13d75ccb806260079e0679d55d9253e370ec8a)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen-netback: use RING_COPY_REQUEST() throughout

Instead of open-coding memcpy()s and directly accessing Tx and Rx
requests, use the new RING_COPY_REQUEST() that ensures the local copy
is correct.

This is more than is strictly necessary for guest Rx requests since
only the id and gref fields are used and it is harmless if the
frontend modifies these.

This is part of XSA155.

CC: stable@vger.kernel.org
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 68a33bfd8403e4e22847165d149823a2e0e67c9c)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen-netback: don't use last request to determine minimum Tx credit

The last from guest transmitted request gives no indication about the
minimum amount of credit that the guest might need to send a packet
since the last packet might have been a small one.

Instead allow for the worst case 128 KiB packet.

This is part of XSA155.

CC: stable@vger.kernel.org
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 0f589967a73f1f30ab4ac4dd9ce0bb399b4d6357)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen: Add RING_COPY_REQUEST()

Using RING_GET_REQUEST() on a shared ring is easy to use incorrectly
(i.e., by not considering that the other end may alter the data in the
shared ring while it is being inspected). Safe usage of a request
generally requires taking a local copy.

Provide a RING_COPY_REQUEST() macro to use instead of
RING_GET_REQUEST() and an open-coded memcpy(). This takes care of
ensuring that the copy is done correctly regardless of any possible
compiler optimizations.

Use a volatile source to prevent the compiler from reordering or
omitting the copy.

This is part of XSA155.

CC: stable@vger.kernel.org
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 454d5d882c7e412b840e3c99010fe81a9862f6fb)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/x86/pvh: Use HVM's flush_tlb_others op

Using MMUEXT_TLB_FLUSH_MULTI doesn't buy us much since the hypervisor
will likely perform same IPIs as would have the guest.

More importantly, using MMUEXT_INVLPG_MULTI may not to invalidate the
guest's address on remote CPU (when, for example, VCPU from another guest
is running there).

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 20f36e0380a7e871a711d5e4e59d04d4948326b4)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen: Resume PMU from non-atomic context

Resuming PMU currently triggers a warning from ___might_sleep() (assuming
CONFIG_DEBUG_ATOMIC_SLEEP is set) when xen_pmu_init() allocates GFP_KERNEL
page because we are in state resembling atomic context.

Move resuming PMU to xen_arch_resume() which is called in regular context.
For symmetry move suspending PMU to xen_arch_suspend() as well.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <stable@vger.kernel.org> # 4.3
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit de0afc9bdeeadaa998797d2333c754bf9f4d5dcf)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/events/fifo: Consume unprocessed events when a CPU dies

When a CPU is offlined, there may be unprocessed events on a port for
that CPU. If the port is subsequently reused on a different CPU, it
could be in an unexpected state with the link bit set, resulting in
interrupts being missed. Fix this by consuming any unprocessed events
for a particular CPU when that CPU dies.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Cc: <stable@vger.kernel.org> # 3.14+
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 3de88d622fd68bd4dbee0f80168218b23f798fd0)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/evtchn: dynamically grow pending event channel ring

If more than 1024 event channels are bound to a evtchn device then it
possible (even with well behaved applications) for the ring to
overflow and events to be lost (reported as an -EFBIG error).

Dynamically increase the size of the ring so there is always enough
space for all bound events.  Well behaved applicables that only unmask
events after draining them from the ring can thus no longer lose
events.

However, an application could unmask an event before draining it,
allowing multiple entries per port to accumulate in the ring, and a
overflow could still occur.  So the overflow detection and reporting
is retained.

The ring size is initially only 64 entries so the common use case of
an application only binding a few events will use less memory than
before.  The ring size may grow to 512 KiB (enough for all 2^17
possible channels).  This order 7 kmalloc() may fail due to memory
fragmentation, so we fall back to trying vmalloc().

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 8620015499101090ae275bf11e9bc2f9febfdf08)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/gntdev: Grant maps should not be subject to NUMA balancing

Doing so will cause the grant to be unmapped and then, during
fault handling, the fault to be mistakenly treated as NUMA hint
fault.

In addition, even if those maps could partcipate in NUMA
balancing, it wouldn't provide any benefit since we are unable
to determine physical page's node (even if/when VNUMA is
implemented).

Marking grant maps' VMAs as VM_IO will exclude them from being
part of NUMA balancing.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 9c17d96500f78d7ecdb71ca6942830158bc75a2b)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen: fix the check of e_pfn in xen_find_pfn_range

On some NUMA system, after dom0 up, we see below warning even if there are
enough pfn ranges that could be used for remapping:
"Unable to find available pfn range, not remapping identity pages"

Fix it to avoid getting a memory region of zero size in xen_find_pfn_range.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit abed7d0710e8f892c267932a9492ccf447674fb8)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

x86/xen: add reschedule point when mapping foreign GFNs

Mapping a large range of foreign GFNs can take a long time, add a
reschedule point after each batch of 16 GFNs.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
(cherry picked from commit 914beb9fc26d6225295b8315ab54026f8f22755c)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/arm: don't try to re-register vcpu_info on cpu_hotplug.

Call disable_percpu_irq on CPU_DYING and enable_percpu_irq when the cpu
is coming up.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Julien Grall <julien.grall@citrix.com>
(cherry picked from commit cb9644bf3b549d20656cca02e8a6332c67cf37d6)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen, cpu_hotplug: call device_offline instead of cpu_down

When offlining a cpu, instead of cpu_down, call device_offline, which
also takes care of updating the cpu.dev.offline field. This keeps the
sysfs file /sys/devices/system/cpu/cpuN/online, up to date. Also move
the call to disable_hotplug_cpu, because it makes more sense to have it
there.

We don't call device_online at cpu-hotplug time, because that would
immediately take the cpu online, while we want to retain the current
behaviour: the user needs to explicitly enable the cpu after it has
been hotplugged.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
CC: konrad.wilk@oracle.com
CC: boris.ostrovsky@oracle.com
CC: david.vrabel@citrix.com
(cherry picked from commit 1c7a62137bb23bc8a2c05d1dad6105afa569b20e)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/arm: Enable cpu_hotplug.c

Build cpu_hotplug for ARM and ARM64 guests.

Rename arch_(un)register_cpu to xen_(un)register_cpu and provide an
empty implementation on ARM and ARM64. On x86 just call
arch_(un)register_cpu as we are already doing.

Initialize cpu_hotplug on ARM.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
(cherry picked from commit a314e3eb845389b8f68130c79a63832229dea87b)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xenbus: Support multiple grants ring with 64KB

The PV ring may use multiple grants and expect them to be mapped
contiguously in the virtual memory.

Although, the current code is relying on a Linux page will be mapped to
a single grant. On build where Linux is using a different page size than
the grant (i.e other than 4KB), the grant will always be mapped on the
first 4KB of each Linux page which make the final ring not contiguous in
the memory.

This can be fixed by mapping multiple grant in a same Linux page.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 89bf4b4e4a8d9ab219cd03aada24e782cf0ac359)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/grant-table: Add an helper to iterate over a specific number of grants

With the 64KB page granularity support on ARM64, a Linux page may be
split accross multiple grant.

Currently we have the helper gnttab_foreach_grant_in_grant to break a
Linux page based on an offset and a len, but it doesn't fit when we only
have a number of grants in hand.

Introduce a new helper which take an array of Linux page and a number of
grant and will figure out the address of each grant.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit f73314b28148f9ee9f89a0ae961c8fb36e3269fa)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/xenbus: Rename *RING_PAGE* to *RING_GRANT*

Linux may use a different page size than the size of grant. So make
clear that the order is actually in number of grant.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 9cce2914e2b21339dca12c91dc9f35790366cc4c)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/arm: correct comment in enlighten.c

Correct a comment in arch/arm/xen/enlighten.c referencing a wrong
source file.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit d5f985c834fbfdedcb629b009ba272ad50add5ac)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/gntdev: use types from linux/types.h in userspace headers

__u32, __u64 etc. are preferred for userspace API headers.

Signed-off-by: Mikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit a36012be64e65760d208c23ea68dc12a895001d8)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/gntalloc: use types from linux/types.h in userspace headers

__u32, __u64 etc. are preferred for userspace API headers.

Signed-off-by: Mikko Rapeli <mikko.rapeli@iki.fi>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 91afb7c373e881d5038a78e1206a0f6469440ec3)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/balloon: Use the correct sizeof when declaring frame_list

The type of the item in frame_list is xen_pfn_t which is not an unsigned
long on ARM but an uint64_t.

With the current computation, the size of frame_list will be 2 *
PAGE_SIZE rather than PAGE_SIZE.

I bet it's just mistake when the type has been switched from "unsigned
long" to "xen_pfn_t" in commit 965c0aaafe3e75d4e65cd4ec862915869bde3abd
"xen: balloon: use correct type for frame_list".

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 3990dd27034606312429a09c807ea74a6ec32dde)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/swiotlb: Add support for 64KB page granularity

Swiotlb is used on ARM64 to support DMA on platform where devices are
not protected by an SMMU. Furthermore it's only enabled for DOM0.

While Xen is always using 4KB page granularity in the stage-2 page table,
Linux ARM64 may either use 4KB or 64KB. This means that a Linux page
can be spanned accross multiple Xen page.

The Swiotlb code has to validate that the buffer used for DMA is
physically contiguous in the memory. As a Linux page can't be shared
between local memory and foreign page by design (the balloon code always
removing entirely a Linux page), the changes in the code are very
minimal because we only need to check the first Xen PFN.

Note that it may be possible to optimize the function
check_page_physically_contiguous to avoid looping over every Xen PFN
for local memory. Although I will let this optimization for a follow-up.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 9435cce87950d805e6c8315410f2cb8ff6b2c6a2)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/swiotlb: Pass addresses rather than frame numbers to xen_arch_need_swiotlb

With 64KB page granularity support, the frame number will be different.

It will be easier to modify the behavior in a single place rather than
in each caller.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 291be10fd7511101d44cf98166d049bd31bc7600)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

arm/xen: Add support for 64KB page granularity

The hypercall interface is always using 4KB page granularity. This is
requiring to use xen page definition macro when we deal with hypercall.

Note that pfn_to_gfn is working with a Xen pfn (i.e 4KB). We may want to
rename pfn_gfn to make this explicit.

We also allocate a 64KB page for the shared page even though only the
first 4KB is used. I don't think this is really important for now as it
helps to have the pointer 4KB aligned (XENMEM_add_to_physmap is taking a
Xen PFN).

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 250c9af3d831139317009eaebbe82e20d23a581f)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/privcmd: Add support for Linux 64KB page granularity

The hypercall interface (as well as the toolstack) is always using 4KB
page granularity. When the toolstack is asking for mapping a series of
guest PFN in a batch, it expects to have the page map contiguously in
its virtual memory.

When Linux is using 64KB page granularity, the privcmd driver will have
to map multiple Xen PFN in a single Linux page.

Note that this solution works on page granularity which is a multiple of
4KB.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 5995a68a6272e4e8f4fe4de82cdc877e650fe8be)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

net/xen-netback: Make it running on 64KB page granularity

The PV network protocol is using 4KB page granularity. The goal of this
patch is to allow a Linux using 64KB page granularity working as a
network backend on a non-modified Xen.

It's only necessary to adapt the ring size and break skb data in small
chunk of 4KB. The rest of the code is relying on the grant table code.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit d0089e8a0e4c9723d85b01713671358e3d6960df)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

net/xen-netfront: Make it running on 64KB page granularity

The PV network protocol is using 4KB page granularity. The goal of this
patch is to allow a Linux using 64KB page granularity using network
device on a non-modified Xen.

It's only necessary to adapt the ring size and break skb data in small
chunk of 4KB. The rest of the code is relying on the grant table code.

Note that we allocate a Linux page for each rx skb but only the first
4KB is used. We may improve the memory usage by extending the size of
the rx skb.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 30c5d7f0da82f55c86c0a09bf21c0623474bb17f)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

block/xen-blkback: Make it running on 64KB page granularity

The PV block protocol is using 4KB page granularity. The goal of this
patch is to allow a Linux using 64KB page granularity behaving as a
block backend on a non-modified Xen.

It's only necessary to adapt the ring size and the number of request per
indirect frames. The rest of the code is relying on the grant table
code.

Note that the grant table code is allocating a Linux page per grant
which will result to waste 6OKB for every grant when Linux is using 64KB
page granularity. This could be improved by sharing the page between
multiple grants.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: "Roger Pau Monné" <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 67de5dfbc176ea86ab0278658b5d55f64207ff2d)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

block/xen-blkfront: Make it running on 64KB page granularity

The PV block protocol is using 4KB page granularity. The goal of this
patch is to allow a Linux using 64KB page granularity using block
device on a non-modified Xen.

The block API is using segment which should at least be the size of a
Linux page. Therefore, the driver will have to break the page in chunk
of 4K before giving the page to the backend.

When breaking a 64KB segment in 4KB chunks, it is possible that some
chunks are empty. As the PV protocol always require to have data in the
chunk, we have to count the number of Xen page which will be in use and
avoid sending empty chunks.

Note that, a pre-defined number of grants are reserved before preparing
the request. This pre-defined number is based on the number and the
maximum size of the segments. If each segment contains a very small
amount of data, the driver may reserve too many grants (16 grants is
reserved per segment with 64KB page granularity).

Furthermore, in the case of persistent grants we allocate one Linux page
per grant although only the first 4KB of the page will be effectively
in use. This could be improved by sharing the page with multiple grants.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit c004a6fe0c405e2aa91b2a88aa1428724e6d06f6)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/grant-table: Make it running on 64KB granularity

The Xen interface is using 4KB page granularity. This means that each
grant is 4KB.

The current implementation allocates a Linux page per grant. On Linux
using 64KB page granularity, only the first 4KB of the page will be
used.

We could decrease the memory wasted by sharing the page with multiple
grant. It will require some care with the {Set,Clear}ForeignPage macro.

Note that no changes has been made in the x86 code because both Linux
and Xen will only use 4KB page granularity.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 5ed5451d997f7a86c62a5557efc00dc3836dc559)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/events: fifo: Make it running on 64KB granularity

Only use the first 4KB of the page to store the events channel info. It
means that we will waste 60KB every time we allocate page for:
     * control block: a page is allocating per CPU
     * event array: a page is allocating everytime we need to expand it

I think we can reduce the memory waste for the 2 areas by:

    * control block: sharing between multiple vCPUs. Although it will
    require some bookkeeping in order to not free the page when the CPU
    goes offline and the other CPUs sharing the page still there

    * event array: always extend the array event by 64K (i.e 16 4K
    chunk). That would require more care when we fail to expand the
    event channel.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit a001c9d95c4ea96589461d58e77c96416a303e2c)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/balloon: Don't rely on the page granularity is the same for Xen and Linux

For ARM64 guests, Linux is able to support either 64K or 4K page
granularity. Although, the hypercall interface is always based on 4K
page granularity.

With 64K page granularity, a single page will be spread over multiple
Xen frame.

To avoid splitting the page into 4K frame, take advantage of the
extent_order field to directly allocate/free chunk of the Linux page
size.

Note that PVMMU is only used for PV guest (which is x86) and the page
granularity is always 4KB. Some BUILD_BUG_ON has been added to ensure
that because the code has not been modified.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 30756c62997822894fb34e2114f5dc727a12af30)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

tty/hvc: xen: Use xen page definition

The console ring is always based on the page granularity of Xen.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 9652c08012580c9961c77fc8726a877e0d437324)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/xenbus: Use Xen page definition

All the ring (xenstore, and PV rings) are always based on the page
granularity of Xen.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 7d567928db59cb249e5539ebb63890e24431669a)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/biomerge: Don't allow biovec's to be merged when Linux is not using 4KB pages

On ARM all dma-capable devices on a same platform may not be protected
by an IOMMU. The DMA requests have to use the BFN (i.e MFN on ARM) in
order to use correctly the device.

While the DOM0 memory is allocated in a 1:1 fashion (PFN == MFN), grant
mapping will screw this contiguous mapping.

When Linux is using 64KB page granularitary, the page may be split
accross multiple non-contiguous MFN (Xen is using 4KB page
granularity). Therefore a DMA request will likely fail.

Checking that a 64KB page is using contiguous MFN is tedious. For
now, always says that biovec are not mergeable.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 36f8abd36febf1c6f67dae26ec7a87be44629138)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

block/xen-blkfront: split get_grant in 2

Prepare the code to support 64KB page granularity. The first
implementation will use a full Linux page per indirect and persistent
grant. When non-persistent grant is used, each page of a bio request
may be split in multiple grant.

Furthermore, the field page of the grant structure is only used to copy
data from persistent grant or indirect grant. Avoid to set it for other
use case as it will have no meaning given the page will be split in
multiple grant.

Provide 2 functions, to setup indirect grant, the other for bio page.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 4f503fbdf319e4411aa48852b8922c93a9cc0c5d)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

block/xen-blkfront: Store a page rather a pfn in the grant structure

All the usage of the field pfn are done using the same idiom:

pfn_to_page(grant->pfn)

This will return always the same page. Store directly the page in the
grant to clean up the code.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit a7a6df222351de23791bb64165f14c21ff4d1653)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

block/xen-blkfront: Split blkif_queue_request in 2

Currently, blkif_queue_request has 2 distinct execution path:
- Send a discard request
- Send a read/write request

The function is also allocating grants to use for generating the
request. Although, this is only used for read/write request.

Rather than having a function with 2 distinct execution path, separate
the function in 2. This will also remove one level of tabulation.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 33204663ef85d9ce79009b2246afe6a2fef8eb1b)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/grant: Add helper gnttab_page_grant_foreign_access_ref_one

Many PV drivers contain the idiom:

pfn = page_to_gfn(...) /* Or similar */
gnttab_grant_foreign_access_ref

Replace it by a new helper. Note that when Linux is using a different
page granularity than Xen, the helper only gives access to the first 4KB
grant.

This is useful where drivers are allocating a full Linux page for each
grant.

Also include xen/interface/grant_table.h rather than xen/grant_table.h in
asm/page.h for x86 to fix a compilation issue [1]. Only the former is
useful in order to get the structure definition.

[1] Interdependency between asm/page.h and xen/grant_table.h which result
to page_mfn not being defined when necessary.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 3922f32c1e6db2e096ff095a5b8af0b940b97508)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/grant: Introduce helpers to split a page into grant

Currently, a grant is always based on the Xen page granularity (i.e
4KB). When Linux is using a different page granularity, a single page
will be split between multiple grants.

The new helpers will be in charge of splitting the Linux page into grants
and call a function given by the caller on each grant.

Also provide an helper to count the number of grants within a given
contiguous region.

Note that the x86/include/asm/xen/page.h is now including
xen/interface/grant_table.h rather than xen/grant_table.h. It's
necessary because xen/grant_table.h depends on asm/xen/page.h and will
break the compilation. Furthermore, only definition in
interface/grant_table.h is required.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 008c320a96d218712043f8db0111d5472697785c)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen: Add Xen specific page definition

The Xen hypercall interface is always using 4K page granularity on ARM
and x86 architecture.

With the incoming support of 64K page granularity for ARM64 guest, it
won't be possible to re-use the Linux page definition in Xen drivers.

Introduce Xen page definition helpers based on the Linux page
definition. They have exactly the same name but prefixed with
XEN_/xen_ prefix.

Also modify xen_page_to_gfn to use new Xen page definition.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 1084b1988d22dc165c9dbbc2b0e057f9248ac4db)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

arm/xen: Drop pte_mfn and mfn_pte

They are not used in common code expect in one place in balloon.c which is
only compiled when Linux is using PV MMU. It's not the case on ARM.

Rather than worrying how to handle the 64KB case, drop them.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 5031612b5eaf15cb6471c6e936a515090810c8f1)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

net/xen-netback: xenvif_gop_frag_copy: move GSO check out of the loop

The skb doesn't change within the function. Therefore it's only
necessary to check if we need GSO once at the beginning.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit a0f2e80fcd3b6b1369527828b02caab01af60c36)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/balloon: pre-allocate p2m entries for ballooned pages

Pages returned by alloc_xenballooned_pages() will be used for grant
mapping which will call set_phys_to_machine() (in PV guests).

Ballooned pages are set as INVALID_P2M_ENTRY in the p2m and thus may
be using the (shared) missing tables and a subsequent
set_phys_to_machine() will need to allocate new tables.

Since the grant mapping may be done from a context that cannot sleep,
the p2m entries must already be allocated.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
(cherry picked from commit 4a69c909deb0dd3cae653d14ac0ff52d5440a19c)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

x86/xen: export xen_alloc_p2m_entry()

Rename alloc_p2m() to xen_alloc_p2m_entry() and export it.

This is useful for ensuring that a p2m entry is allocated (i.e., not a
shared missing or identity entry) so that subsequent set_phys_to_machine()
calls will require no further allocations.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
---
v3:
- Make xen_alloc_p2m_entry() a nop on auto-xlate guests.

(cherry picked from commit 8edfcf882eb91ec9028c7334f90f6ef3db5b0fcf)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/balloon: use hotplugged pages for foreign mappings etc.

alloc_xenballooned_pages() is used to get ballooned pages to back
foreign mappings etc.  Instead of having to balloon out real pages,
use (if supported) hotplugged memory.

This makes more memory available to the guest and reduces
fragmentation in the p2m.

This is only enabled if the xen.balloon.hotplug_unpopulated sysctl is
set to 1.  This sysctl defaults to 0 in case the udev rules to
automatically online hotplugged memory do not exist.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
---
v3:
- Add xen.balloon.hotplug_unpopulated sysctl to enable use of hotplug
  for unpopulated pages.

(cherry picked from commit 1cf6a6c82918c9aad4bb73a7e7379a649e4d8e50)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/balloon: make alloc_xenballoon_pages() always allocate low pages

All users of alloc_xenballoon_pages() wanted low memory pages, so
remove the option for high memory.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
(cherry picked from commit 81b286e0f1fe520f2a96f736ffa7e508ac9139ba)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/balloon: only hotplug additional memory if required

Now that we track the total number of pages (included hotplugged
regions), it is easy to determine if more memory needs to be
hotplugged.

Add a new BP_WAIT state to signal that the balloon process needs to
wait until kicked by the memory add notifier (when the new section is
onlined by userspace).

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
---
v3:
- Return BP_WAIT if enough sections are already hotplugged.

v2:
- New BP_WAIT status after adding new memory sections.

(cherry picked from commit b2ac6aa8f71bf57b948ee68cd913c350b932da88)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/balloon: rationalize memory hotplug stats

The stats used for memory hotplug make no sense and are fiddled with
in odd ways. Remove them and introduce total_pages to track the total
number of pages (both populated and unpopulated) including those within
hotplugged regions (note that this includes not yet onlined pages).

This will be used in a subsequent commit (xen/balloon: only hotplug
additional memory if required) when deciding whether additional memory
needs to be hotplugged.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
(cherry picked from commit de5a77d8422fc7ed0b2f4349bceb65a1a639e5b2)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/balloon: find non-conflicting regions to place hotplugged memory

Instead of placing hotplugged memory at the end of RAM (which may
conflict with PCI devices or reserved regions) use allocate_resource()
to get a new, suitably aligned resource that does not conflict.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
---
v3:
- Remove stale comment.

(cherry picked from commit 55b3da98a40dbb3776f7454daf0d95dde25c33d2)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

x86/xen: discard RAM regions above the maximum reservation

During setup, discard RAM regions that are above the maximum
reservation (instead of marking them as E820_UNUSABLE). This allows
hotplug memory to be placed at these addresses.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
(cherry picked from commit f5775e0b6116b7e2425ccf535243b21768566d87)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/balloon: remove scratch page left overs

Commit 0bb599fd30108883b00c7d4a226eeb49111e6932 (xen: remove scratch
frames for ballooned pages and m2p override) removed the use of the
scratch page for ballooned out pages.

Remove some left over function definitions.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
(cherry picked from commit f6a6cb1afe74d6ccc81aa70aa4ac3953762e7e6e)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

mm: memory hotplug with an existing resource

Add add_memory_resource() to add memory using an existing "System RAM"
resource. This is useful if the memory region is being located by
finding a free resource slot with allocate_resource().

Xen guests will make use of this in their balloon driver to hotplug
arbitrary amounts of memory in response to toolstack requests.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Conflicts:
include/linux/memory_hotplug.h
mm/memory_hotplug.c

(cherry picked from commit 62cedb9f135794ec26a93ae29e5f0231ab263c84)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen-netfront: always set num queues if possible

If netfront connects with two (or more) queues and then reconnects with
only one queue it fails to delete or rewrite the multi-queue-num-queues
key and netback will try to use the wrong number of queues.

Always write the num-queues field if the backend has multi-queue support.

Signed-off-by: Chas Williams <3chas3@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 812494d9a0cacf77e0a538be18183c7b471812aa)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

Merge branch 'uek4/4.3-xen-backport' of git://ca-git.us.oracle.com/linux-konrad-public into linux-4.1/4.4-xen-backport

* uek4/4.3-xen-backport of git://ca-git.us.oracle.com/linux-konrad-public: (100 commits)
  xen-netback: correctly check failed allocation
  xen/blkback: free requests on disconnection
  x86/xen/p2m: hint at the last populated P2M entry
  x86/xen: Do not clip xen_e820_map to xen_e820_map_entries when sanitizing map
  x86/xen: Support kexec/kdump in HVM guests by doing a soft reset
  xen/x86: Don't try to write syscall-related MSRs for PV guests
  xen: use correct type for HYPERVISOR_memory_op()
  xen/xenbus: Rename the variable xen_store_mfn to xen_store_gfn
  xen/privcmd: Further s/MFN/GFN/ clean-up
  hvc/xen: Further s/MFN/GFN clean-up
  video/xen-fbfront: Further s/MFN/GFN clean-up
  xen/tmem: Use xen_page_to_gfn rather than pfn_to_gfn
  xen: Use correctly the Xen memory terminologies
  arm/xen: implement correctly pfn_to_mfn
  xen: Make clear that swiotlb and biomerge are dealing with DMA address
  xen: switch extra memory accounting to use pfns
  xen: limit memory to architectural maximum
  xen: avoid another early crash of memory limited dom0
  xen: avoid early crash of memory limited dom0
  arm/xen: Remove helpers which are PV specific
  ...

Backports from Linux 4.3

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

x86/iopl/64: properly context-switch IOPL on Xen PV

On Xen PV, regs->flags doesn't reliably reflect IOPL and the
exit-to-userspace code doesn't change IOPL.  We need to context
switch it manually.

I'm doing this without going through paravirt because this is
specific to Xen PV.  After the dust settles, we can merge this with
the 32-bit code, tidy up the iopl syscall implementation, and remove
the set_iopl pvop entirely.

This is XSA-171.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Orabug: 22926124
Conflicts:
  arch/x86/kernel/process_64.c
  X86_FEATURE_XENPV not defined, too much extra to bring it in.
  Replaced with xen_pv_domain() which is trigger to set X86_FEATURE_XENPV.
CVE: CVE-2016-3157
Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>

Merge branch 'topic/uek-4.1/xen-konrad-4.3-xen-backport-refresh' into topic/uek-4.1/xen

* topic/uek-4.1/xen-konrad-4.3-xen-backport-refresh: (48 commits)
  xen-netback: correctly check failed allocation
  xen/blkback: free requests on disconnection
  x86/xen/p2m: hint at the last populated P2M entry
  x86/xen: Do not clip xen_e820_map to xen_e820_map_entries when sanitizing map
  x86/xen: Support kexec/kdump in HVM guests by doing a soft reset
  xen/x86: Don't try to write syscall-related MSRs for PV guests
  xen: use correct type for HYPERVISOR_memory_op()
  xen/xenbus: Rename the variable xen_store_mfn to xen_store_gfn
  xen/privcmd: Further s/MFN/GFN/ clean-up
  hvc/xen: Further s/MFN/GFN clean-up
  video/xen-fbfront: Further s/MFN/GFN clean-up
  xen/tmem: Use xen_page_to_gfn rather than pfn_to_gfn
  xen: Use correctly the Xen memory terminologies
  arm/xen: implement correctly pfn_to_mfn
  xen: Make clear that swiotlb and biomerge are dealing with DMA address
  xen: switch extra memory accounting to use pfns
  xen: limit memory to architectural maximum
  xen: avoid another early crash of memory limited dom0
  xen: avoid early crash of memory limited dom0
  arm/xen: Remove helpers which are PV specific
  ...

xen-netback: correctly check failed allocation

Since vzalloc can be failed in memory pressure,
writes -ENOMEM to xenstore to indicate error.

Signed-off-by: Insu Yun <wuninsu@gmail.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 833b8f18adfcca04070a8a42d545a4553379d36f)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit afdd9c09c1689f43598a520ed60813dc6359d6c6)

xen/blkback: free requests on disconnection

This is due to commit 86839c56dee28c315a4c19b7bfee450ccd84cd25
"xen/block: add multi-page ring support"

When using an guest under UEFI - after the domain is destroyed
the following warning comes from blkback.

------------[ cut here ]------------
WARNING: CPU: 2 PID: 95 at
/home/julien/works/linux/drivers/block/xen-blkback/xenbus.c:274
xen_blkif_deferred_free+0x1f4/0x1f8()
Modules linked in:
CPU: 2 PID: 95 Comm: kworker/2:1 Tainted: G W 4.2.0 #85
Hardware name: APM X-Gene Mustang board (DT)
Workqueue: events xen_blkif_deferred_free
Call trace:
[<ffff8000000890a8>] dump_backtrace+0x0/0x124
[<ffff8000000891dc>] show_stack+0x10/0x1c
[<ffff8000007653bc>] dump_stack+0x78/0x98
[<ffff800000097e88>] warn_slowpath_common+0x9c/0xd4
[<ffff800000097f80>] warn_slowpath_null+0x14/0x20
[<ffff800000557a0c>] xen_blkif_deferred_free+0x1f0/0x1f8
[<ffff8000000ad020>] process_one_work+0x160/0x3b4
[<ffff8000000ad3b4>] worker_thread+0x140/0x494
[<ffff8000000b2e34>] kthread+0xd8/0xf0
---[ end trace 6f859b7883c88cdd ]---

Request allocation has been moved to connect_ring, which is called every
time blkback connects to the frontend (this can happen multiple times during
a blkback instance life cycle). On the other hand, request freeing has not
been moved, so it's only called when destroying the backend instance. Due to
this mismatch, blkback can allocate the request pool multiple times, without
freeing it.

In order to fix it, move the freeing of requests to xen_blkif_disconnect to
restore the symmetry between request allocation and freeing.

Reported-by: Julien Grall <julien.grall@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: Julien Grall <julien.grall@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: xen-devel@lists.xenproject.org
CC: stable@vger.kernel.org # 4.2
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit f929d42ceb18a8acfd47e0e7b7d90b5d49bd9258)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 92fbb0ff8a675a08d5c00db0b50b36de36934b7f)

x86/xen/p2m: hint at the last populated P2M entry

With commit 633d6f17cd91ad5bf2370265946f716e42d388c6 (x86/xen: prepare
p2m list for memory hotplug) the P2M may be sized to accomdate a much
larger amount of memory than the domain currently has.

When saving a domain, the toolstack must scan all the P2M looking for
populated pages. This results in a performance regression due to the
unnecessary scanning.

Instead of reporting (via shared_info) the maximum possible size of
the P2M, hint at the last PFN which might be populated. This hint is
increased as new leaves are added to the P2M (in the expectation that
they will be used for populated entries).

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: <stable@vger.kernel.org> # 4.0+
(cherry picked from commit 98dd166ea3a3c3b57919e20d9b0d1237fcd0349d)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit f80232d154a44edf89dc40a0754dd74628efce34)

x86/xen: Do not clip xen_e820_map to xen_e820_map_entries when sanitizing map

Sanitizing the e820 map may produce extra E820 entries which would result in
the topmost E820 entries being removed. The removed entries would typically
include the top E820 usable RAM region and thus result in the domain having
signicantly less RAM available to it.

Fix by allowing sanitize_e820_map to use the full size of the allocated E820
array.

Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 64c98e7f49100b637cd20a6c63508caed6bbba7a)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 1ea4ebd1eb569fc92ea891881262292fd4e4368e)

x86/xen: Support kexec/kdump in HVM guests by doing a soft reset

Currently there is a number of issues preventing PVHVM Xen guests from
doing successful kexec/kdump:

  - Bound event channels.
  - Registered vcpu_info.
  - PIRQ/emuirq mappings.
  - shared_info frame after XENMAPSPACE_shared_info operation.
  - Active grant mappings.

Basically, newly booted kernel stumbles upon already set up Xen
interfaces and there is no way to reestablish them. In Xen-4.7 a new
feature called 'soft reset' is coming. A guest performing kexec/kdump
operation is supposed to call SCHEDOP_shutdown hypercall with
SHUTDOWN_soft_reset reason before jumping to new kernel. Hypervisor
(with some help from toolstack) will do full domain cleanup (but
keeping its memory and vCPU contexts intact) returning the guest to
the state it had when it was first booted and thus allowing it to
start over.

Doing SHUTDOWN_soft_reset on Xen hypervisors which don't support it is
probably OK as by default all unknown shutdown reasons cause domain
destroy with a message in toolstack log: 'Unknown shutdown reason code
5. Destroying domain.'  which gives a clue to what the problem is and
eliminates false expectations.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 0b34a166f291d255755be46e43ed5497cdd194f2)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 1f63233f62ab07bd0617f92b2118fe7475674979)

xen/x86: Don't try to write syscall-related MSRs for PV guests

For PV guests these registers are set up by hypervisor and thus
should not be written by the guest. The comment in xen_write_msr_safe()
says so but we still write the MSRs, causing the hypervisor to
print a warning.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 2ecf91b6d8b0ee8ef38aa7ea2a0fe0cd57b6ca50)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit c8411fc452a0ff2a535525bfea9e266b234c34c8)