www.infradead.org Git - users/jedix/linux-maple.git/log

xen/arm: implement alloc/free_xenballooned_pages with alloc_pages/kfree

Only until we get the balloon driver to work.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit ea54209b16cbecad8928f6067af29069ac44e360)

xen/arm: receive Xen events on ARM

Compile events.c on ARM.
Parse, map and enable the IRQ to get event notifications from the device
tree (node "/xen").

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 0ec53ecf38bcbf95b4b057328a8fbba4d22ef28b)

xen/arm: initialize grant_table on ARM

Initialize the grant table mapping at the address specified at index 0
in the DT under the /xen node.
After the grant table is initialized, call xenbus_probe (if not dom0).

Changes in v2:

- introduce GRANT_TABLE_PHYSADDR;
- remove unneeded initialization of boot_max_nr_grant_frames.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
(cherry picked from commit b3b52fd87e8f7544fde75a471108bd5bd4492c90)

xen/arm: get privilege status

Use Xen features to figure out if we are privileged.

XENFEAT_dom0 was introduced by 23735 in xen-unstable.hg.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit ef61ee0dc7ba0409dc0e8122de90d4e48d4c8669)

xen/arm: introduce CONFIG_XEN on ARM

Changes in v5:

- make XEN_DOM0 depend on XEN;
- avoid "select XEN_DOM0" in XEN.

Changes in v2:

- mark Xen guest support on ARM as EXPERIMENTAL.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
CC: Sergei Shtylyov <sshtylyov@mvista.com>
(cherry picked from commit eff8d6447d5fac2995ffa5c1f0ea2da5bd7074c9)

xen: do not compile manage, balloon, pci, acpi, pcpu and cpu_hotplug on ARM

Changes in v4:
- compile pcpu only on x86;
- use "+=" instead of ":=" for dom0- targets.

Changes in v2:

- make pci.o depend on CONFIG_PCI and acpi.o depend on CONFIG_ACPI.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 13febc84849d4f14a088fe3b09111bdb1951ab42)

Conflicts:
drivers/xen/Makefile

xen/tmem: cleanup

Use 'bool' for boolean variables. Do proper section placement.
Eliminate an unnecessary export.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 8e6f7c23c135b13f3adf90906fac7edd325bb9af)

Conflicts:
drivers/xen/tmem.c

xen: Add selfballoning memory reservation tunable.

Currently, the memory target in the Xen selfballooning driver is mainly
driven by the value of "Committed_AS". However, there are cases in
which it is desirable to assign additional memory to be available for
the kernel, e.g. for local caches (which are not covered by cleancache),
e.g. dcache and inode caches.

This adds an additional tunable in the selfballooning driver (accessible
via sysfs) which allows the user to specify an additional constant
amount of memory to be reserved by the selfballoning driver for the
local domain.

Signed-off-by: Jana Saout <jana@saout.de>
Acked-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit d79d5959a023fd637e90ed1ff6547ff09d19396b)

Conflicts:
drivers/xen/xen-selfballoon.c
[UEK2 hasn't done the 'convert sysdev_class to a regular subsystem'
patchset]

xen: constify all instances of "struct attribute_group"

The functions these get passed to have been taking pointers to const
since at least 2.6.16.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit ead1d01425bbd28c4354b539caa4075bde00ed72)

xen: Fix selfballooning and ensure it doesn't go too far

The balloon driver's "current_pages" is very different from
totalram_pages.  Self-ballooning needs to be driven by
the latter.  Also, Committed_AS doesn't account for pages
used by the kernel so:
1) Add totalreserve_pages to Committed_AS for the normal target.
2) Enforce a floor for when there are little or no user-space threads
   using memory (e.g. single-user mode) to avoid OOMs.  The floor
   function includes a "min_usable_mb" tuneable in case we discover
   later that the floor function is still too aggressive in some
   workloads, though likely it will not be needed.

Changes since version 4:
- change floor calculation so that it is not as aggressive; this version
  uses a piecewise linear function similar to minimum_target in the 2.6.18
  balloon driver, but modified to add to totalreserve_pages instead of
  subtract from max_pfn, the 2.6.18 version causes OOMs on recent kernels
  because the kernel has expanded over time
- change safety_margin to min_usable_mb and comment on its use
- since committed_as does NOT include kernel space (and other reserved
  pages), totalreserve_pages is now added to committed_as.  The result is
  less aggressive self-ballooning, but theoretically more appropriate.
Changes since version 3:
- missing include causes compile problem when CONFIG_FRONTSWAP is disabled
- add comments after includes
Changes since version 2:
- missing include causes compile problem only on 32-bit
Changes since version 1:
- tuneable safety margin added

[v5: avi.miller@oracle.com: still too aggressive, seeing some OOMs]
[v4: konrad.wilk@oracle.com: fix compile when CONFIG_FRONTSWAP is disabled]
[v3: guru.anbalagane@oracle.com: fix 32-bit compile]
[v2: konrad.wilk@oracle.com: make safety margin tuneable]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
[v1: Altered description and added an extra include]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 38a1ed4f039db32b418007ac365076cf53647ebd)

xen: self-balloon needs module.h

Fix build errors (found when CONFIG_SYSFS is not enabled):

  drivers/xen/xen-selfballoon.c:446: warning: data definition has no type or storage class
  drivers/xen/xen-selfballoon.c:446: warning: type defaults to 'int' in declaration of 'EXPORT_SYMBOL'
  drivers/xen/xen-selfballoon.c:446: warning: parameter names (without types) in function declaration
  drivers/xen/xen-selfballoon.c:485: error: expected declaration specifiers or '...' before string constant
  drivers/xen/xen-selfballoon.c:485: warning: data definition has no type or storage class
  drivers/xen/xen-selfballoon.c:485: warning: type defaults to 'int' in declaration of 'MODULE_LICENSE'
  drivers/xen/xen-selfballoon.c:485: warning: function declaration isn't a prototype

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 4fec0e0bde09095b6349dc6206dbf19cebcd0a7e)

xen/balloon: Fix compile errors - missing header files.

With a specific enough .config file compile errors show
for missing workqueue declarations.

Reported-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 0642d2edc858a1f08716bb32e1ab890db8dac246)

xen: tmem: self-ballooning and frontswap-selfshrinking

This patch introduces two in-kernel drivers for Xen transcendent memory
("tmem") functionality that complement cleancache and frontswap. Both
use control theory to dynamically adjust and optimize memory utilization.
Selfballooning controls the in-kernel Xen balloon driver, targeting a goal
value (vm_committed_as), thus pushing less frequently used clean
page cache pages (through the cleancache code) into Xen tmem where
Xen can balance needs across all VMs residing on the physical machine.
Frontswap-selfshrinking controls the number of pages in frontswap,
driving it towards zero (effectively doing a partial swapoff) when
in-kernel memory pressure subsides, freeing up RAM for other VMs.

More detail is provided in the header comment of xen-selfballooning.c.

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
[v8: konrad.wilk@oracle.com: set default enablement depending on frontswap]
[v7: konrad.wilk@oracle.com: fix capitalization and punctuation in comments]
[v6: fix frontswap-selfshrinking initialization]
[v6: konrad.wilk@oracle.com: fix init pr_infos; add comments about swap]
[v5: konrad.wilk@oracle.com: add NULL to attr list; move inits up to decls]
[v4: dkiper@net-space.pl: use strict_strtoul plus a few syntactic nits]
[v3: konrad.wilk@oracle.com: fix potential divides-by-zero]
[v3: konrad.wilk@oracle.com: add many more comments, fix nits]
[v2: rebased to linux-3.0-rc1]
[v2: Ian.Campbell@citrix.com: reorganize as new file (xen-selfballoon.c)]
[v2: dkiper@net-space.pl: proper access to vm_committed_as]
[v2: dkiper@net-space.pl: accounting fixes]
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: <xen-devel@lists.xensource.com>
(cherry picked from commit a50777c791031d7345ce95785ea6220f67339d90)

Conflicts:
drivers/xen/xen-balloon.c
include/xen/balloon.h

xen: grant: use xen_pfn_t type for frame_list.

This correctly sizes it as 64 bit on ARM but leaves it as unsigned
long on x86 (therefore no intended change on x86).

The long and ulong guest handles are now unused (and a bit dangerous)
so remove them.

Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: sysfs: fix build warning.

Define PRI macros for xen_ulong_t and xen_pfn_t and use to fix:
drivers/xen/sys-hypervisor.c:288:4: warning: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'xen_ulong_t' [-Wformat]

Ideally this would use PRIx64 on ARM but these (or equivalent) don't
seem to be available in the kernel.

Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/arm: Introduce xen_ulong_t for unsigned long

All the original Xen headers have xen_ulong_t as unsigned long type, however
when they have been imported in Linux, xen_ulong_t has been replaced with
unsigned long. That might work for x86 and ia64 but it does not for arm.
Bring back xen_ulong_t and let each architecture define xen_ulong_t as they
see fit.

Also explicitly size pointers (__DEFINE_GUEST_HANDLE) to 64 bit.

Changes in v3:

- remove the incorrect changes to multicall_entry;
- remove the change to apic_physbase.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: Introduce xen_pfn_t for pfn and mfn types

All the original Xen headers have xen_pfn_t as mfn and pfn type, however
when they have been imported in Linux, xen_pfn_t has been replaced with
unsigned long. That might work for x86 and ia64 but it does not for arm.
Bring back xen_pfn_t and let each architecture define xen_pfn_t as they
see fit.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/arm: Xen detection and shared_info page mapping

Check for a node in the device tree compatible with "xen,xen", if it is
present set xen_domain_type to XEN_HVM_DOMAIN and continue
initialization.

Map the real shared info page using XENMEM_add_to_physmap with
XENMAPSPACE_shared_info.

Changes in v4:

- simpler parsing of Xen version in the compatible DT node.

Changes in v3:

- use the "xen,xen" notation rather than "arm,xen";
- add an additional check on the presence of the Xen version.

Changes in v2:

- replace pr_info with pr_debug.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
(cherry picked from commit 2e01f16601d8924b12b1acf1cdc49a0d1cc1cfb2)

docs: Xen ARM DT bindings

Add a doc to describe the Xen ARM device tree bindings

Changes in v5:

- add a comment about the size of the grant table memory region;
- add a comment about the required presence of a GIC node;
- specify that the described properties are part of a top-level
"hypervisor" node;
- specify #address-cells and #size-cells for the example.

Changes in v4:

- "xen,xen" should be last as it is less specific;
- update reg property using 2 address-cells and 2 size-cells.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Rob Herring <rob.herring@calxeda.com>
CC: devicetree-discuss@lists.ozlabs.org
CC: David Vrabel <david.vrabel@citrix.com>
CC: Rob Herring <robherring2@gmail.com>
CC: Dave Martin <dave.martin@linaro.org>
(cherry picked from commit c43cdfbc4cebdf1a7992432615bf5155b51b8cc0)

xen/arm: empty implementation of grant_table arch specific functions

Changes in v2:

- return -ENOSYS rather than -1.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 226f52e931ae6235d221629e6a8497fb111cfcd4)

xen/arm: sync_bitops

sync_bitops functions are equivalent to the SMP implementation of the
original functions, independently from CONFIG_SMP being defined.

We need them because _set_bit etc are not SMP safe if !CONFIG_SMP. But
under Xen you might be communicating with a completely external entity
who might be on another CPU (e.g. two uniprocessor guests communicating
via event channels and grant tables). So we need a variant of the bit
ops which are SMP safe even on a UP kernel.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit e54d2f61528165bbe0e5f628b111eab3be31c3b5)

xen/arm: page.h definitions

ARM Xen guests always use paging in hardware, like PV on HVM guests in
the X86 world.

Changes in v3:

- improve comments.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 36a67abce227af005b4a595735556942b74f6741)

xen/arm: hypercalls

Use r12 to pass the hypercall number to the hypervisor.

We need a register to pass the hypercall number because we might not
know it at compile time and HVC only takes an immediate argument.

Among the available registers r12 seems to be the best choice because it
is defined as "intra-procedure call scratch register".

Use the ISS to pass an hypervisor specific tag.

Changes in v2:
- define an HYPERCALL macro for 5 arguments hypercall wrappers, even if
at the moment is unused;
- use ldm instead of pop;
- fix up comments.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit aa2466d21bd9e872690693d56feb946781443f28)

arm: initial Xen support

- Basic hypervisor.h and interface.h definitions.
- Skeleton enlighten.c, set xen_start_info to an empty struct.
- Make xen_initial_domain dependent on the SIF_PRIVILIGED_BIT.

The new code only compiles when CONFIG_XEN is set, that is going to be
added to arch/arm/Kconfig in patch #11 "xen/arm: introduce CONFIG_XEN on
ARM".

Changes in v3:

- improve comments.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 4c071ee5268f7234c3d084b6093bebccc28cdcba)

xen/vga: add the xen EFI video mode support

In order to add xen EFI frambebuffer video support, it is required to add
xen-efi's new video type (XEN_VGATYPE_EFI_LFB) case and handle it in the
function xen_init_vga and set the video type to VIDEO_TYPE_EFI to enable
efi video mode.

The original patch from which this was broken out from:
http://marc.info/?i=4E099AA6020000780004A4C6@nat28.tlf.novell.com

Signed-off-by: Jan Beulich <JBeulich@novell.com>
Signed-off-by: Tang Liang <liang.tang@oracle.com>
[v2: The original author is Jan Beulich and Liang Tang ported it to upstream]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit aa387d630cfed1a694a9c8c61fba3877ba8d4f07)

xen: allow enable use of VGA console on dom0

Get the information about the VGA console hardware from Xen, and put
it into the form the bootloader normally generates, so that the rest
of the kernel can deal with VGA as usual.

[ Impact: make VGA console work in dom0 ]

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
[v1: Rebased on 2.6.39]
[v2: Removed incorrect comments and fixed compile warnings]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit c2419b4a4727f67af2fc2cd68b0d878b75e781bb)

Conflicts:
arch/x86/xen/Makefile
arch/x86/xen/enlighten.c
arch/x86/xen/xen-ops.h

xen/pcifront: Use Xen-SWIOTLB when initting if required.

We piggyback on "xen/swiotlb: Use the swiotlb_late_init_with_tbl to init
Xen-SWIOTLB late when PV PCI is used." functionality to start up
the Xen-SWIOTLB if we are hot-plugged. This allows us to bypass
the need to supply 'iommu=soft' on the Linux command line (mostly).
With this patch, if a user forgot 'iommu=soft' on the command line,
and hotplug a PCI device they will get:

pcifront pci-0: Installing PCI frontend
Warning: only able to allocate 4 MB for software IO TLB
software IO TLB [mem 0x2a000000-0x2a3fffff] (4MB) mapped at [ffff88002a000000-ffff88002a3fffff]
pcifront pci-0: Creating PCI Frontend Bus 0000:00
pcifront pci-0: PCI host bridge to bus 0000:00
pci_bus 0000:00: root bus resource [io 0x0000-0xffff]
pci_bus 0000:00: root bus resource [mem 0x00000000-0xfffffffff]
pci 0000:00:00.0: [8086:10d3] type 00 class 0x020000
pci 0000:00:00.0: reg 10: [mem 0xfe5c0000-0xfe5dffff]
pci 0000:00:00.0: reg 14: [mem 0xfe500000-0xfe57ffff]
pci 0000:00:00.0: reg 18: [io 0xe000-0xe01f]
pci 0000:00:00.0: reg 1c: [mem 0xfe5e0000-0xfe5e3fff]
pcifront pci-0: claiming resource 0000:00:00.0/0
pcifront pci-0: claiming resource 0000:00:00.0/1
pcifront pci-0: claiming resource 0000:00:00.0/2
pcifront pci-0: claiming resource 0000:00:00.0/3
e1000e: Intel(R) PRO/1000 Network Driver - 2.0.0-k
e1000e: Copyright(c) 1999 - 2012 Intel Corporation.
e1000e 0000:00:00.0: Disabling ASPM L0s L1
e1000e 0000:00:00.0: enabling device (0000 -> 0002)
e1000e 0000:00:00.0: Xen PCI mapped GSI16 to IRQ34
e1000e 0000:00:00.0: (unregistered net_device): Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
e1000e 0000:00:00.0: eth0: (PCI Express:2.5GT/s:Width x1) 00:1b:21:ab:c6:13
e1000e 0000:00:00.0: eth0: Intel(R) PRO/1000 Network Connection
e1000e 0000:00:00.0: eth0: MAC: 3, PHY: 8, PBA No: E46981-005

The "Warning only" will go away if one supplies 'iommu=soft' instead
as we have a higher chance of being able to allocate large swaths of
memory.

Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 3d925320e9e2de162bd138bf97816bda8c3f71be)

xen/swiotlb: For early initialization, return zero on success.

If everything is setup properly we would return -ENOMEM since
rc by default is set to that value. Lets not do that and return
a proper return code.

Note: The reason the early code needs this special treatment
is that it SWIOTLB library call does not return anything (and
had it failed it would call panic()) - but our function does.

Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit c468bdee28a1cb61d9b7a8ce9859d17dee43b7d7)

xen/swiotlb: Use the swiotlb_late_init_with_tbl to init Xen-SWIOTLB late when PV PCI is used.

With this patch we provide the functionality to initialize the
Xen-SWIOTLB late in the bootup cycle - specifically for
Xen PCI-frontend. We still will work if the user had
supplied 'iommu=soft' on the Linux command line.

Note: We cannot depend on after_bootmem to automatically
determine whether this is early or not. This is because
when PCI IOMMUs are initialized it is after after_bootmem but
before a lot of "other" subsystems are initialized.

CC: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
[v1: Fix smatch warnings]
[v2: Added check for xen_swiotlb]
[v3: Rebased with new xen-swiotlb changes]
[v4: squashed xen/swiotlb: Depending on after_bootmem is not correct in]
Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit b82776005369899c1c7ca2e4b2414bb64b538d2c)

xen/swiotlb: Move the error strings to its own function.

That way we can more easily reuse those errors when using the
late SWIOTLB init.

Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 5bab7864b1167f9a72d375f6854027db436a1cc1)

xen/swiotlb: Move the nr_tbl determination in its own function.

Moving the function out of the way to prepare for the late
SWIOTLB init.

Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 1cef36a529f44dbb3612ee0deeb0b5563de36163)

xen: Use correct masking in xen_swiotlb_alloc_coherent.

When running 32-bit pvops-dom0 and a driver tries to allocate a coherent
DMA-memory the xen swiotlb-implementation returned memory beyond 4GB.

The underlaying reason is that if the supplied driver passes in a
DMA_BIT_MASK(64) ( hwdev->coherent_dma_mask is set to 0xffffffffffffffff)
our dma_mask will be u64 set to 0xffffffffffffffff even if we set it to
DMA_BIT_MASK(32) previously. Meaning we do not reset the upper bits.
By using the dma_alloc_coherent_mask function - it does the proper casting
and we get 0xfffffffff.

This caused not working sound on a system with 4 GB and a 64-bit
compatible sound-card with sets the DMA-mask to 64bit.

On bare-metal and the forward-ported xen-dom0 patches from OpenSuse a coherent
DMA-memory is always allocated inside the 32-bit address-range by calling
dma_alloc_coherent_mask.

This patch adds the same functionality to xen swiotlb and is a rebase of the
original patch from Ronny Hegewald which never got upstream b/c the
underlaying reason was not understood until now.

The original email with the original patch is in:
http://old-list-archives.xen.org/archives/html/xen-devel/2010-02/msg00038.html
the original thread from where the discussion started is in:
http://old-list-archives.xen.org/archives/html/xen-devel/2010-01/msg00928.html

Signed-off-by: Ronny Hegewald <ronny.hegewald@online.de>
Signed-off-by: Stefano Panella <stefano.panella@citrix.com>
Acked-By: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC: stable@vger.kernel.org
(cherry picked from commit b5031ed1be0aa419250557123633453753181643)

xen/swiotlb: Use page alignment for early buffer allocation.

This fixes an odd bug found on a Dell PowerEdge 1850/0RC130
(BIOS A05 01/09/2006) where all of the modules doing pci_set_dma_mask
would fail with:

ata_piix 0000:00:1f.1: enabling device (0005 -> 0007)
ata_piix 0000:00:1f.1: can't derive routing for PCI INT A
ata_piix 0000:00:1f.1: BMDMA: failed to set dma mask, falling back to PIO

The issue was the Xen-SWIOTLB was allocated such as that the end of
buffer was stradling a page (and also above 4GB). The fix was
spotted by Kalev Leonid which was to piggyback on git commit
e79f86b2ef9c0a8c47225217c1018b7d3d90101c "swiotlb: Use page alignment
for early buffer allocation" which:

We could call free_bootmem_late() if swiotlb is not used, and
it will shrink to page alignment.

So alloc them with page alignment at first, to avoid lose two pages

And doing that fixes the outstanding issue.

CC: stable@kernel.org
Suggested-by: "Kalev, Leonid" <Leonid.Kalev@ca.com>
Reported-and-Tested-by: "Taylor, Neal E" <Neal.Taylor@ca.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 63a741757d15320a25ebf5778f8651cce2ed0611)

swiotlb: Expose swiotlb_nr_tlb function to modules

As a mechanism to detect whether SWIOTLB is enabled or not.
We also fix the spelling - it was swioltb instead of
swiotlb.

CC: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
[v1: Ripped out swiotlb_enabled]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit f21ffe9f6da6d3a69c518b7345c198d48d941c34)

Conflicts:
include/linux/swiotlb.h

xen-swiotlb: When doing coherent alloc/dealloc check before swizzling the MFNs.

The process to swizzle a Machine Frame Number (MFN) is not always
necessary. Especially if we know that we actually do not have to do it.
In this patch we check the MFN against the device's coherent
DMA mask and if the requested page(s) are contingous. If it all checks
out we will just return the bus addr without doing the memory
swizzle.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 6810df88dcfc22de267caf23eb072ffb97b3c411)

xen-swiotlb: fix printk and panic args

Fix printk() and panic() args [swap them] to fix build warnings:

drivers/xen/swiotlb-xen.c:201: warning: format '%s' expects type 'char *', but argument 2 has type 'int'
drivers/xen/swiotlb-xen.c:201: warning: format '%d' expects type 'int', but argument 3 has type 'char *'
drivers/xen/swiotlb-xen.c:202: warning: format '%s' expects type 'char *', but argument 2 has type 'int'
drivers/xen/swiotlb-xen.c:202: warning: format '%d' expects type 'int', but argument 3 has type 'char *'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 61ca79831ce52c23b3a130f3c2351751e00e0ac9)

xen-swiotlb: Fix wrong panic.

Propagate the baremetal git commit "swiotlb: fix wrong panic"
(fba99fa38b023224680308a482e12a0eca87e4e1) in the Xen-SWIOTLB version.
wherein swiotlb's map_page wrongly calls panic() when it can't find
a buffer fit for device's dma mask. It should return an error instead.

Devices with an odd dma mask (i.e. under 4G) like b44 network card hit
this bug (the system crashes):

http://marc.info/?l=linux-kernel&m=129648943830106&w=2

If xen-swiotlb returns an error, b44 driver can use the own bouncing
mechanism.

CC: stable@kernel.org
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit ab2a47bd242d6cdcf6b2b64797f271c6f0a6d338)

xen-swiotlb: Retry up three times to allocate Xen-SWIOTLB

We can fail seting up Xen-SWIOTLB if:
- The host does not have enough contiguous DMA32 memory available
   (can happen on a machine that has fragmented memory from starting,
   stopping many guests).
- Not enough low memory (almost never happens).

We retry allocating and exchanging the swath of contiguous memory
up to three times. Each time we decrease the amount we need  - the
minimum being of 2MB.

If we compleltly fail, we will print the reason for failure on the Xen
console on top of doing it to earlyprintk=xen console.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit f4b2f07b2ed9b469ead87e06fc2fc3d12663a725)

swiotlb: add the late swiotlb initialization function with iotlb memory

This enables the caller to initialize swiotlb with its own iotlb
memory late in the bootup.

See git commit eb605a5754d050a25a9f00d718fb173f24c486ef
"swiotlb: add swiotlb_tbl_map_single library function" which will
explain the full details of what it can be used for.

CC: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
[v1: Fold in smatch warning]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 74838b75379a53678ffc5f59de86161d21e2c808)

Conflicts:
include/linux/swiotlb.h

xen/swiotlb: With more than 4GB on 64-bit, disable the native SWIOTLB.

If a PV guest is booted the native SWIOTLB should not be
turned on. It does not help us (we don't have any PCI devices)
and it eats 64MB of good memory. In the case of PV guests
with PCI devices we need the Xen-SWIOTLB one.

[v1: Rewrite comment per Stefano's suggestion]
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit fc2341df9e31be8a3940f4e302372d7ef46bab8c)

Conflicts:
arch/x86/xen/pci-swiotlb-xen.c

xen/swiotlb: Simplify the logic.

Its pretty easy:
1). We only check to see if we need Xen SWIOTLB for PV guests.
2). If swiotlb=force or iommu=soft is set, then Xen SWIOTLB will
be enabled.
3). If it is an initial domain, then Xen SWIOTLB will be enabled.
4). Native SWIOTLB must be disabled for PV guests.

Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 988f0e24bbcbbf550dff016faf8273a94f4eb1af)

xen/gndev: Xen backend support for paged out grant targets V4.

Since Xen-4.2, hvm domains may have portions of their memory paged out. When a
foreign domain (such as dom0) attempts to map these frames, the map will
initially fail. The hypervisor returns a suitable errno, and kicks an
asynchronous page-in operation carried out by a helper. The foreign domain is
expected to retry the mapping operation until it eventually succeeds. The
foreign domain is not put to sleep because itself could be the one running the
pager assist (typical scenario for dom0).

This patch adds support for this mechanism for backend drivers using grant
mapping and copying operations. Specifically, this covers the blkback and
gntdev drivers (which map foreign grants), and the netback driver (which copies
foreign grants).

* Add a retry method for grants that fail with GNTST_eagain (i.e. because the
target foreign frame is paged out).
* Insert hooks with appropriate wrappers in the aforementioned drivers.

The retry loop is only invoked if the grant operation status is GNTST_eagain.
It guarantees to leave a new status code different from GNTST_eagain. Any other
status code results in identical code execution as before.

The retry loop performs 256 attempts with increasing time intervals through a
32 second period. It uses msleep to yield while waiting for the next retry.

V2 after feedback from David Vrabel:
* Explicit MAX_DELAY instead of wrap-around delay into zero
* Abstract GNTST_eagain check into core grant table code for netback module.

V3 after feedback from Ian Campbell:
* Add placeholder in array of grant table error descriptions for unrelated
error code we jump over.
* Eliminate single map and retry macro in favor of a generic batch flavor.
* Some renaming.
* Bury most implementation in grant_table.c, cleaner interface.

V4 rebased on top of sync of Xen grant table interface headers.

Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
[v5: Fixed whitespace issues]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit c571898ffc24a1768e1b2dabeac0fc7dd4c14601)

xen/arm: compile and run xenbus

bind_evtchn_to_irqhandler can legitimately return 0 (irq 0): it is not
an error.

If Linux is running as an HVM domain and is running as Dom0, use
xenstored_local_init to initialize the xenstore page and event channel.

Changes in v4:
- do not xs_reset_watches on dom0.

Changes in v2:
- refactor xenbus_init.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
[v5: Fixed case switch indentations]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit ecc635f90adfe1b7cd5fd354f49edfbf24aa4e3e)

xen: clear IRQ_NOAUTOEN and IRQ_NOREQUEST

Reset the IRQ_NOAUTOEN and IRQ_NOREQUEST flags that are enabled by
default on ARM. If IRQ_NOAUTOEN is set, __setup_irq doesn't call
irq_startup, that is responsible for calling irq_unmask at startup time.
As a result event channels remain masked.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit a8636c0b2e57d4f31f71aa306b1ee701db3f3c85)

xen/events: fix unmask_evtchn for PV on HVM guests

When unmask_evtchn is called, if we already have an event pending, we
just set evtchn_pending_sel waiting for local_irq_enable to be called.
That is because PV guests set the irq_enable pvops to
xen_irq_enable_direct in xen_setup_vcpu_info_placement:
xen_irq_enable_direct is implemented in assembly in
arch/x86/xen/xen-asm.S and call xen_force_evtchn_callback if
XEN_vcpu_info_pending is set.

However HVM guests (and ARM guests) do not change or do not have the
irq_enable pvop, so evtchn_unmask cannot work properly for them.

Considering that having the pending_irq bit set when unmask_evtchn is
called is not very common, and it is simpler to keep the
native_irq_enable implementation for HVM guests (and ARM guests), the
best thing to do is just use the EVTCHNOP_unmask hypercall (Xen
re-injects pending events in response).

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit b5e579232d635b79a3da052964cb357ccda8d9ea)

xen/privcmd: Correctly return success from IOCTL_PRIVCMD_MMAPBATCH

This is a regression introduced by ceb90fa0 (xen/privcmd: add
PRIVCMD_MMAPBATCH_V2 ioctl). It broke xentrace as it used
xc_map_foreign() instead of xc_map_foreign_bulk().

Most code-paths prefer the MMAPBATCH_V2, so this wasn't very obvious
that it broke. The return value is set early on to -EINVAL, and if all
goes well, the "set top bits of the MFN's" never gets called, so the
return value is still EINVAL when the function gets to the end, causing
the caller to think it went wrong (which it didn't!)

Now also including Andres "move the ret = -EINVAL into the error handling
path, as this avoids other similar errors in future.

Signed-off-by: Mats Petersson <mats.petersson@citrix.com>
Acked-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
Acked-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 68fa965dd923177eafad49b7a0045fc610917341)

xen/mmu: Use Xen specific TLB flush instead of the generic one.

As Mukesh explained it, the MMUEXT_TLB_FLUSH_ALL allows the
hypervisor to do a TLB flush on all active vCPUs. If instead
we were using the generic one (which ends up being xen_flush_tlb)
we end up making the MMUEXT_TLB_FLUSH_LOCAL hypercall. But
before we make that hypercall the kernel will IPI all of the
vCPUs (even those that were asleep from the hypervisor
perspective). The end result is that we needlessly wake them
up and do a TLB flush when we can just let the hypervisor
do it correctly.

This patch gives around 50% speed improvement when migrating
idle guest's from one host to another.

Oracle-bug: 14630170

CC: stable@vger.kernel.org
Tested-by: Jingjie Jiang <jingjie.jiang@oracle.com>
Suggested-by: Mukesh Rathor <mukesh.rathor@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 95a7d76897c1e7243d4137037c66d15cbf2cce76)

xen/enlighten: Disable MWAIT_LEAF so that acpi-pad won't be loaded.

There are exactly four users of __monitor and __mwait:

- cstate.c (which allows acpi_processor_ffh_cstate_enter to be called
   when the cpuidle API drivers are used. However patch
   "cpuidle: replace xen access to x86 pm_idle and default_idle"
   provides a mechanism to disable the cpuidle and use safe_halt.
- smpboot (which allows mwait_play_dead to be called). However
   safe_halt is always used so we skip that.
- intel_idle (same deal as above).
- acpi_pad.c. This the one that we do not want to run as we
   will hit the below crash.

Why do we want to expose MWAIT_LEAF in the first place?
We want it for the xen-acpi-processor driver - which uploads
C-states to the hypervisor. If MWAIT_LEAF is set, the cstate.c
sets the proper address in the C-states so that the hypervisor
can benefit from using the MWAIT functionality. And that is
the sole reason for using it.

Without this patch, if a module performs mwait or monitor we
get this:

invalid opcode: 0000 [#1] SMP
CPU 2
.. snip..
Pid: 5036, comm: insmod Tainted: G           O 3.4.0-rc2upstream-dirty #2 Intel Corporation S2600CP/S2600CP
RIP: e030:[<ffffffffa000a017>]  [<ffffffffa000a017>] mwait_check_init+0x17/0x1000 [mwait_check]
RSP: e02b:ffff8801c298bf18  EFLAGS: 00010282
RAX: ffff8801c298a010 RBX: ffffffffa03b2000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff8801c29800d8 RDI: ffff8801ff097200
RBP: ffff8801c298bf18 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
R13: ffffffffa000a000 R14: 0000005148db7294 R15: 0000000000000003
FS:  00007fbb364f2700(0000) GS:ffff8801ff08c000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000179f038 CR3: 00000001c9469000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process insmod (pid: 5036, threadinfo ffff8801c298a000, task ffff8801c29cd7e0)
Stack:
ffff8801c298bf48 ffffffff81002124 ffffffffa03b2000 00000000000081fd
000000000178f010 000000000178f030 ffff8801c298bf78 ffffffff810c41e6
00007fff3fb30db9 00007fff3fb30db9 00000000000081fd 0000000000010000
Call Trace:
[<ffffffff81002124>] do_one_initcall+0x124/0x170
[<ffffffff810c41e6>] sys_init_module+0xc6/0x220
[<ffffffff815b15b9>] system_call_fastpath+0x16/0x1b
Code: <0f> 01 c8 31 c0 0f 01 c9 c9 c3 00 00 00 00 00 00 00 00 00 00 00 00
RIP  [<ffffffffa000a017>] mwait_check_init+0x17/0x1000 [mwait_check]
RSP <ffff8801c298bf18>
---[ end trace 16582fc8a3d1e29a ]---
Kernel panic - not syncing: Fatal exception

With this module (which is what acpi_pad.c would hit):

MODULE_AUTHOR("Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>");
MODULE_DESCRIPTION("mwait_check_and_back");
MODULE_LICENSE("GPL");
MODULE_VERSION();

static int __init mwait_check_init(void)
{
__monitor((void *)&current_thread_info()->flags, 0, 0);
__mwait(0, 0);
return 0;
}
static void __exit mwait_check_exit(void)
{
}
module_init(mwait_check_init);
module_exit(mwait_check_exit);

Reported-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit df88b2d96e36d9a9e325bfcd12eb45671cbbc937)

x86, amd, xen: Avoid NULL pointer paravirt references

Stub out MSR methods that aren't actually needed. This fixes a crash
as Xen Dom0 on AMD Trinity systems. A bigger patch should be added to
remove the paravirt machinery completely for the methods which
apparently have no users!

Reported-by: Andre Przywara <andre.przywara@amd.com>
Link: http://lkml.kernel.org/r/20120530222356.GA28417@andromeda.dapyr.net
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
(cherry picked from commit 1ab46fd319bcf1fcd9fb6311727d532b580e4eba)

xen/setup: filter APERFMPERF cpuid feature out

Xen PV kernels allow access to the APERF/MPERF registers to read the
effective frequency. Access to the MSRs is however redirected to the
currently scheduled physical CPU, making consecutive read and
compares unreliable. In addition each rdmsr traps into the hypervisor.
So to avoid bogus readouts and expensive traps, disable the kernel
internal feature flag for APERF/MPERF if running under Xen.
This will
a) remove the aperfmperf flag from /proc/cpuinfo
b) not mislead the power scheduler (arch/x86/kernel/cpu/sched.c) to
use the feature to improve scheduling (by default disabled)
c) not mislead the cpufreq driver to use the MSRs

This does not cover userland programs which access the MSRs via the
device file interface, but this will be addressed separately.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Cc: stable@vger.kernel.org # v3.0+
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 5e626254206a709c6e937f3dda69bf26c7344f6f)

xen/enlighten: Expose MWAIT and MWAIT_LEAF if hypervisor OKs it.

For the hypervisor to take advantage of the MWAIT support it needs
to extract from the ACPI _CST the register address. But the
hypervisor does not have the support to parse DSDT so it relies on
the initial domain (dom0) to parse the ACPI Power Management information
and push it up to the hypervisor. The pushing of the data is done
by the processor_harveset_xen module which parses the information that
the ACPI parser has graciously exposed in 'struct acpi_processor'.

For the ACPI parser to also expose the Cx states for MWAIT, we need
to expose the MWAIT capability (leaf 1). Furthermore we also need to
expose the MWAIT_LEAF capability (leaf 5) for cstate.c to properly
function.

The hypervisor could expose these flags when it traps the XEN_EMULATE_PREFIX
operations, but it can't do it since it needs to be backwards compatible.
Instead we choose to use the native CPUID to figure out if the MWAIT
capability exists and use the XEN_SET_PDC query hypercall to figure out
if the hypervisor wants us to expose the MWAIT_LEAF capability or not.

Note: The XEN_SET_PDC query was implemented in c/s 23783:
"ACPI: add _PDC input override mechanism".

With this in place, instead of
C3 ACPI IOPORT 415
we get now
C3:ACPI FFH INTEL MWAIT 0x20

Note: The cpu_idle which would be calling the mwait variants for idling
never gets set b/c we set the default pm_idle to be the hypercall variant.

Acked-by: Jan Beulich <JBeulich@suse.com>
[v2: Fix missing header file include and #ifdef]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 73c154c60be106b47f15d1111fc2d75cc7a436f2)

Conflicts:
arch/x86/xen/enlighten.c

xen/acpi: Fix potential memory leak.

Coverity points out that we do not free in one case the
pr_backup - and sure enough we forgot.

Found by Coverity (CID 401970)

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 17f9b896b06d314da890174584278dea8da7e0ce)

xen-acpi-processor: Add missing #include <xen/xen.h>

This file depends on <xen/xen.h>, but the dependency was hidden due
to: <asm/acpi.h> -> <asm/trampoline.h> -> <asm/io.h> -> <xen/xen.h>

With the removal of <asm/trampoline.h>, this exposed the missing

Reported-by: Ingo Molnar <mingo@kernel.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jarkko Sakkinen <jarkko.sakkinen@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
(cherry picked from commit 323f90a60864f30fd6b7c99806584bb90ada1a29)

xen/acpi: Workaround broken BIOSes exporting non-existing C-states.

We did a similar check for the P-states but did not do it for
the C-states. What we want to do is ignore cases where the DSDT
has definition for sixteen CPUs, but the machine only has eight
CPUs and we get:
xen-acpi-processor: (CX): Hypervisor error (-22) for ACPI CPU14

Reported-by: Tobias Geiger <tobias.geiger@vido.info>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit b930fe5e1f5646e071facda70b25b137ebeae5af)

xen/acpi: Remove the WARN's as they just create noise.

When booting the kernel under machines that do not have P-states
we would end up with:

------------[ cut here ]------------
WARNING: at drivers/xen/xen-acpi-processor.c:504
xen_acpi_processor_init+0x286/0
x2e0()
Hardware name: ProLiant BL460c G6
Modules linked in:
Pid: 1, comm: swapper Not tainted 2.6.39-200.0.3.el5uek #1
Call Trace:
  [<ffffffff8191d056>] ? xen_acpi_processor_init+0x286/0x2e0
  [<ffffffff81068300>] warn_slowpath_common+0x90/0xc0
  [<ffffffff8191cdd0>] ? check_acpi_ids+0x1e0/0x1e0
  [<ffffffff8106834a>] warn_slowpath_null+0x1a/0x20
  [<ffffffff8191d056>] xen_acpi_processor_init+0x286/0x2e0
  [<ffffffff8191cdd0>] ? check_acpi_ids+0x1e0/0x1e0
  [<ffffffff81002168>] do_one_initcall+0xe8/0x130

.. snip..

Which is OK - the machines do not have P-states, so we fail to register
to process the _PXX states. But there is no need to WARN the user
of it.

Oracle BZ# 13871288
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 27257fc07c044af99d85400c4bab670342bbc8a5)

xen/acpi-processor: C and P-state driver that uploads said data to hypervisor.

This driver solves three problems:
1). Parse and upload ACPI0007 (or PROCESSOR_TYPE) information to the
hypervisor - aka P-states (cpufreq data).
2). Upload the the Cx state information (cpuidle data).
3). Inhibit CPU frequency scaling drivers from loading.

The reason for wanting to solve 1) and 2) is such that the Xen hypervisor
is the only one that knows the CPU usage of different guests and can
make the proper decision of when to put CPUs and packages in proper states.
Unfortunately the hypervisor has no support to parse ACPI DSDT tables, hence it
needs help from the initial domain to provide this information. The reason
for 3) is that we do not want the initial domain to change P-states while the
hypervisor is doing it as well - it causes rather some funny cases of P-states
transitions.

For this to work, the driver parses the Power Management data and uploads said
information to the Xen hypervisor. It also calls acpi_processor_notify_smm()
to inhibit the other CPU frequency scaling drivers from being loaded.

Everything revolves around the 'struct acpi_processor' structure which
gets updated during the bootup cycle in different stages. At the startup, when
the ACPI parser starts, the C-state information is processed (processor_idle)
and saved in said structure as 'power' element. Later on, the CPU frequency
scaling driver (powernow-k8 or acpi_cpufreq), would call the the
acpi_processor_* (processor_perflib functions) to parse P-states information
and populate in the said structure the 'performance' element.

Since we do not want the CPU frequency scaling drivers from loading
we have to call the acpi_processor_* functions to parse the P-states and
call "acpi_processor_notify_smm" to stop them from loading.

There is also one oddity in this driver which is that under Xen, the
physical online CPU count can be different from the virtual online CPU count.
Meaning that the macros 'for_[online|possible]_cpu' would process only
up to virtual online CPU count. We on the other hand want to process
the full amount of physical CPUs. For that, the driver checks if the ACPI IDs
count is different from the APIC ID count - which can happen if the user
choose to use dom0_max_vcpu argument. In such a case a backup of the PM
structure is used and uploaded to the hypervisor.

[v1-v2: Initial RFC implementations that were posted]
[v3: Changed the name to passthru suggested by Pasi Kärkkäinen <pasik@iki.fi>]
[v4: Added vCPU != pCPU support - aka dom0_max_vcpus support]
[v5: Cleaned up the driver, fix bug under Athlon XP]
[v6: Changed the driver to a CPU frequency governor]
[v7: Jan Beulich <jbeulich@suse.com> suggestion to make it a cpufreq scaling driver
made me rework it as driver that inhibits cpufreq scaling driver]
[v8: Per Jan's review comments, fixed up the driver]
[v9: Allow to continue even if acpi_processor_preregister_perf.. fails]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 59a56802918100c1e39e68c30a2e5ae9f7d837f0)

Conflicts:
drivers/xen/Makefile

xen/acpi: Domain0 acpi parser related platform hypercall

This patches implements the xen_platform_op hypercall, to pass the parsed
ACPI info to hypervisor.

Signed-off-by: Yu Ke <ke.yu@intel.com>
Signed-off-by: Tian Kevin <kevin.tian@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
[v1: Added DEFINE_GUEST.. in appropiate headers]
[v2: Ripped out typedefs]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: resynchronise grant table status codes with upstream

Adds GNTST_address_too_big and GNTST_eagain.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit e58f5b55113b8fd4eb8eb43f5508d87e4862f280)

xen/privcmd: return -EFAULT on error

__copy_to_user() returns the number of bytes remaining to be copied but
we want to return a negative error code here.

Acked-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 9d2be9287107695708e6aae5105a8a518a6cb4d0)

xen/privcmd: Fix mmap batch ioctl error status copy back.

Copy back of per-slot error codes is only necessary for V2. V1 does not provide
an error array, so copyback will unconditionally set the global rc to EFAULT.
Only copyback for V2.

Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 1714df7f2cee6a741c3ed24231ec5db25b90633a)

xen/privcmd: add PRIVCMD_MMAPBATCH_V2 ioctl

PRIVCMD_MMAPBATCH_V2 extends PRIVCMD_MMAPBATCH with an additional
field for reporting the error code for every frame that could not be
mapped. libxc prefers PRIVCMD_MMAPBATCH_V2 over PRIVCMD_MMAPBATCH.

Also expand PRIVCMD_MMAPBATCH to return appropriate error-encoding top nibble
in the mfn array.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit ceb90fa0a8008059ecbbf9114cb89dc71a730bb6)

xen/mm: return more precise error from xen_remap_domain_range()

Callers of xen_remap_domain_range() need to know if the remap failed
because frame is currently paged out.  So they can retry the remap
later on.  Return -ENOENT in this case.

This assumes that the error codes returned by Xen are a subset of
those used by the kernel.  It is unclear if this is defined as part of
the hypercall ABI.

Acked-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 69870a847856a1ba81f655a8633fce5f5b614730)

xen/swiotlb: Fix compile warnings when using plain integer instead of NULL pointer.

arch/x86/xen/pci-swiotlb-xen.c:96:1: warning: Using plain integer as NULL pointer
arch/x86/xen/pci-swiotlb-xen.c:96:1: warning: Using plain integer as NULL pointer

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 6d7083eee3bc088d1fc30eefabd6263bca40c95a)

xen/swiotlb: Remove functions not needed anymore.

Sparse warns us off:
drivers/xen/swiotlb-xen.c:506:1: warning: symbol 'xen_swiotlb_map_sg' was not declared. Should it be static?
drivers/xen/swiotlb-xen.c:534:1: warning: symbol 'xen_swiotlb_unmap_sg' was not declared. Should it be static?

and it looks like we do not need this function at all.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit a8752fd9a4106c5efe324109df133692d5fcbffc)

xen: allow privcmd for HVM guests

This patch removes the "return -ENOSYS" for auto_translated_physmap
guests from privcmd_mmap, thus it allows ARM guests to issue privcmd
mmap calls. However privcmd mmap calls are still going to fail for HVM
and hybrid guests on x86 because the xen_remap_domain_mfn_range
implementation is currently PV only.

Changes in v2:

- better commit message;
- return -EINVAL from xen_remap_domain_mfn_range if
auto_translated_physmap.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 1a1d43318aeb74d679372c0b65029957be274529)

xen/sysfs: Use XENVER_guest_handle to query UUID

This hypercall has been present since Xen 3.1, and is the preferred
method for a domain to obtain its UUID. Fall back to the xenstore method
if using an older version of Xen (which returns -ENOSYS).

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 5c13f8067745efc15f6ad0158b58d57c44104c25)

xen/apic/xenbus/swiotlb/pcifront/grant/tmem: Make functions or variables static.

There is no need for those functions/variables to be visible. Make them
static and also fix the compile warnings of this sort:

drivers/xen/<some file>.c: warning: symbol '<blah>' was not declared. Should it be static?

Some of them just require including the header file that
declares the functions.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit b8b0f559c7b1dcf5503817e518c81c9a18ee45e0)

Conflicts:

arch/x86/xen/apic.c

xen: missing includes

Changes in v2:
- remove pvclock hack;
- remove include linux/types.h from xen/interface/xen.h.
v3:
- Compile under IA64
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 4d9310e39728a87c86eb48492da7546f61189633)

xen: update xen_add_to_physmap interface

Update struct xen_add_to_physmap to be in sync with Xen's version of the
structure.
The size field was introduced by:

changeset:   24164:707d27fe03e7
user:        Jean Guyader <jean.guyader@eu.citrix.com>
date:        Fri Nov 18 13:42:08 2011 +0000
summary:     mm: New XENMEM space, XENMAPSPACE_gmfn_range

According to the comment:

"This new field .size is located in the 16 bits padding between .domid
and .space in struct xen_add_to_physmap to stay compatible with older
versions."

Changes in v2:

- remove erroneous comment in the commit message.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit b58aaa4b0b3506c094308342d746f600468c63d9)

xen/p2m: Fix one by off error in checking the P2M tree directory.

We would traverse the full P2M top directory (from 0->MAX_DOMAIN_PAGES
inclusive) when trying to figure out whether we can re-use some of the
P2M middle leafs.

Which meant that if the kernel was compiled with MAX_DOMAIN_PAGES=512
we would try to use the 512th entry. Fortunately for us the p2m_top_index
has a check for this:

BUG_ON(pfn >= MAX_P2M_PFN);

which we hit and saw this:

(XEN) domain_crash_sync called from entry.S
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
(XEN) ----[ Xen-4.1.2-OVM  x86_64  debug=n  Tainted:    C ]----
(XEN) CPU:    0
(XEN) RIP:    e033:[<ffffffff819cadeb>]
(XEN) RFLAGS: 0000000000000212   EM: 1   CONTEXT: pv guest
(XEN) rax: ffffffff81db5000   rbx: ffffffff81db4000   rcx: 0000000000000000
(XEN) rdx: 0000000000480211   rsi: 0000000000000000   rdi: ffffffff81db4000
(XEN) rbp: ffffffff81793db8   rsp: ffffffff81793d38   r8:  0000000008000000
(XEN) r9:  4000000000000000   r10: 0000000000000000   r11: ffffffff81db7000
(XEN) r12: 0000000000000ff8   r13: ffffffff81df1ff8   r14: ffffffff81db6000
(XEN) r15: 0000000000000ff8   cr0: 000000008005003b   cr4: 00000000000026f0
(XEN) cr3: 0000000661795000   cr2: 0000000000000000

Fixes-Oracle-Bug: 14570662
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/p2m: When revectoring deal with holes in the P2M array.

When we free the PFNs and then subsequently populate them back
during bootup:

Freeing 20000-20200 pfn range: 512 pages freed
1-1 mapping on 20000->20200
Freeing 40000-40200 pfn range: 512 pages freed
1-1 mapping on 40000->40200
Freeing bad80-badf4 pfn range: 116 pages freed
1-1 mapping on bad80->badf4
Freeing badf6-bae7f pfn range: 137 pages freed
1-1 mapping on badf6->bae7f
Freeing bb000-100000 pfn range: 282624 pages freed
1-1 mapping on bb000->100000
Released 283999 pages of unused memory
Set 283999 page(s) to 1-1 mapping
Populating 1acb8a-1f20e9 pfn range: 283999 pages added

We end up having the P2M array (that is the one that was
grafted on the P2M tree) filled with IDENTITY_FRAME or
INVALID_P2M_ENTRY) entries. The patch titled

"xen/p2m: Reuse existing P2M leafs if they are filled with 1:1 PFNs or INVALID."
recycles said slots and replaces the P2M tree leaf's with
&mfn_list[xx] with p2m_identity or p2m_missing.

And re-uses the P2M array sections for other P2M tree leaf's.
For the above mentioned bootup excerpt, the PFNs at
0x20000->0x20200 are going to be IDENTITY based:

P2M[0][256][0] -> P2M[0][257][0] get turned in IDENTITY_FRAME.

We can re-use that and replace P2M[0][256] to point to p2m_identity.
The "old" page (the grafted P2M array provided by Xen) that was at
P2M[0][256] gets put somewhere else. Specifically at P2M[6][358],
b/c when we populate back:

Populating 1acb8a-1f20e9 pfn range: 283999 pages added

we fill P2M[6][358][0] (and P2M[6][358], P2M[6][359], ...) with
the new MFNs.

That is all OK, except when we revector we assume that the PFN
count would be the same in the grafted P2M array and in the
newly allocated. Since that is no longer the case, as we have
holes in the P2M that point to p2m_missing or p2m_identity we
have to take that into account.

[v2: Check for overflow]
[v3: Move within the __va check]
[v4: Fix the computation]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 3fc509fc0c590900568ef516a37101d88f3476f5)

Conflicts:

arch/x86/xen/p2m.c

xen/mmu: Recycle the Xen provided L4, L3, and L2 pages

As we are not using them. We end up only using the L1 pagetables
and grafting those to our page-tables.

[v1: Per Stefano's suggestion squashed two commits]
[v2: Per Stefano's suggestion simplified loop]
[v3: Fix smatch warnings]
[v4: Add more comments]
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 488f046df922af992c1a718eff276529c0510885)

Conflicts:

arch/x86/xen/mmu.c

xen/mmu: If the revector fails, don't attempt to revector anything else.

If the P2M revectoring would fail, we would try to continue on by
cleaning the PMD for L1 (PTE) page-tables. The xen_cleanhighmap
is greedy and erases the PMD on both boundaries. Since the P2M
array can share the PMD, we would wipe out part of the __ka
that is still used in the P2M tree to point to P2M leafs.

This fixes it by bypassing the revectoring and continuing on.
If the revector fails, a nice WARN is printed so we can still
troubleshoot this.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/p2m: When revectoring deal with holes in the P2M array.

When we free the PFNs and then subsequently populate them back
during bootup:

Freeing 20000-20200 pfn range: 512 pages freed
1-1 mapping on 20000->20200
Freeing 40000-40200 pfn range: 512 pages freed
1-1 mapping on 40000->40200
Freeing bad80-badf4 pfn range: 116 pages freed
1-1 mapping on bad80->badf4
Freeing badf6-bae7f pfn range: 137 pages freed
1-1 mapping on badf6->bae7f
Freeing bb000-100000 pfn range: 282624 pages freed
1-1 mapping on bb000->100000
Released 283999 pages of unused memory
Set 283999 page(s) to 1-1 mapping
Populating 1acb8a-1f20e9 pfn range: 283999 pages added

We end up having the P2M array (that is the one that was
grafted on the P2M tree) filled with IDENTITY_FRAME or
INVALID_P2M_ENTRY) entries. The patch titled

"xen/p2m: Reuse existing P2M leafs if they are filled with 1:1 PFNs or INVALID."
recycles said slots and replaces the P2M tree leaf's with
&mfn_list[xx] with p2m_identity or p2m_missing.

And re-uses the P2M array sections for other P2M tree leaf's.
For the above mentioned bootup excerpt, the PFNs at
0x20000->0x20200 are going to be IDENTITY based:

P2M[0][256][0] -> P2M[0][257][0] get turned in IDENTITY_FRAME.

We can re-use that and replace P2M[0][256] to point to p2m_identity.
The "old" page (the grafted P2M array provided by Xen) that was at
P2M[0][256] gets put somewhere else. Specifically at P2M[6][358],
b/c when we populate back:

Populating 1acb8a-1f20e9 pfn range: 283999 pages added

we fill P2M[6][358][0] (and P2M[6][358], P2M[6][359], ...) with
the new MFNs.

That is all OK, except when we revector we assume that the PFN
count would be the same in the grafted P2M array and in the
newly allocated. Since that is no longer the case, as we have
holes in the P2M that point to p2m_missing or p2m_identity we
have to take that into account.

[v2: Check for overflow]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/p2m: Reuse existing P2M leafs if they are filled with 1:1 PFNs or INVALID.

If P2M leaf is completly packed with INVALID_P2M_ENTRY or with
1:1 PFNs (so IDENTITY_FRAME type PFNs), we can swap the P2M leaf
with either a p2m_missing or p2m_identity respectively. The old
page (which was created via extend_brk or was grafted on from the
mfn_list) can be re-used for setting new PFNs.

This also means we can remove git commit:
5bc6f9888db5739abfa0cae279b4b442e4db8049
xen/p2m: Reserve 8MB of _brk space for P2M leafs when populating back
which tried to fix this.

and make the amount that is required to be reserved much smaller.

CC: stable@vger.kernel.org # for 3.5 only.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Revert "xen PVonHVM: move shared_info to MMIO before kexec"

This reverts commit cfa1df57c9047dcd1b743a4c0487eb686bdea013.

It causes an infinite reading loop of pvclock on shutdown in
PVHVM case. Will revisit once its fixed upstream.

xen/mmu: Release just the MFN list, not MFN list and part of pagetables.

We call memblock_reserve for [start of mfn list] -> [PMD aligned end
of mfn list] instead of <start of mfn list> -> <page aligned end of mfn list].

This has the disastrous effect that if at bootup the end of mfn_list is
not PMD aligned we end up returning to memblock parts of the region
past the mfn_list array. And those parts are the PTE tables with
the disastrous effect of seeing this at bootup:

Write protecting the kernel read-only data: 10240k
Freeing unused kernel memory: 1860k freed
Freeing unused kernel memory: 200k freed
(XEN) mm.c:2429:d0 Bad type (saw 1400000000000002 != exp 7000000000000000) for mfn 116a80 (pfn 14e26)
...
(XEN) mm.c:908:d0 Error getting mfn 116a83 (pfn 14e2a) from L1 entry 8000000116a83067 for l1e_owner=0, pg_owner=0
(XEN) mm.c:908:d0 Error getting mfn 4040 (pfn 5555555555555555) from L1 entry 0000000004040601 for l1e_owner=0, pg_owner=0
.. and so on.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 0ebf4641eb144e633ae9a6466f4c9eaa1db6dc9b)

Conflicts:

arch/x86/xen/mmu.c

xen/mmu/enlighten: Fix memblock_x86_reserve_range downport.

This is not for upstream as it memblock_x86_reserve_range is not
used upstream anymore.

When I back-ported the patches:
xen/x86: Use memblock_reserve for sensitive areas.
xen/mmu: Recycle the Xen provided L4, L3, and L2 pages

I simply used sed s/memblock_reserve/memblock_x86_reserve_range/.
That was incorrect as the parameters are different - memblock_reserve
as second expects the size, while memblock_x86_reserve_range expects
the physical address. This patch fixes those bugs.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/p2m: Reserve 8MB of _brk space for P2M leafs when populating back.

When we release pages back during bootup:

Freeing  9d-100 pfn range: 99 pages freed
Freeing  9cf36-9d0d2 pfn range: 412 pages freed
Freeing  9f6bd-9f6bf pfn range: 2 pages freed
Freeing  9f714-9f7bf pfn range: 171 pages freed
Freeing  9f7e0-9f7ff pfn range: 31 pages freed
Freeing  9f800-100000 pfn range: 395264 pages freed
Released 395979 pages of unused memory

We then try to populate those pages back. In the P2M tree however
the space for those leafs must be reserved - as such we use extend_brk.
We reserve 8MB of _brk space, which means we can fit over
1048576 PFNs - which is more than we should ever need.

[v1: Made it 8MB of _brk space instead of 4MB per Jan's suggestion]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 99266871de5006ba7ad0bfece6bb283ede4094b9)

xen/mmu: Remove from __ka space PMD entries for pagetables.

Please first read the description in "xen/mmu: Copy and revector the
P2M tree."

At this stage, the __ka address space (which is what the old
P2M tree was using) is partially disassembled. The cleanup_highmap
has removed the PMD entries from 0-16MB and anything past _brk_end
up to the max_pfn_mapped (which is the end of the ramdisk).

The xen_remove_p2m_tree and code around has ripped out the __ka for
the old P2M array.

Here we continue on doing it to where the Xen page-tables were.
It is safe to do it, as the page-tables are addressed using __va.
For good measure we delete anything that is within MODULES_VADDR
and up to the end of the PMD.

At this point the __ka only contains PMD entries for the start
of the kernel up to __brk.

[v1: Per Stefano's suggestion wrapped the MODULES_VADDR in debug]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 4e928e1a48b6b76e0b8384160213a32d03197e4b)

xen/mmu: Copy and revector the P2M tree.

Please first read the description in "xen/p2m: Add logic to revector a
P2M tree to use __va leafs" patch.

The 'xen_revector_p2m_tree()' function allocates a new P2M tree
copies the contents of the old one in it, and returns the new one.

At this stage, the __ka address space (which is what the old
P2M tree was using) is partially disassembled. The cleanup_highmap
has removed the PMD entries from 0-16MB and anything past _brk_end
up to the max_pfn_mapped (which is the end of the ramdisk).

We have revectored the P2M tree (and the one for save/restore as well)
to use new shiny __va address to new MFNs. The xen_start_info
has been taken care of already in 'xen_setup_kernel_pagetable()' and
xen_start_info->shared_info in 'xen_setup_shared_info()', so
we are free to roam and delete PMD entries - which is exactly what
we are going to do. We rip out the __ka for the old P2M array.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:

arch/x86/xen/mmu.c
[upstream git commit 3a06359601deaec046ce33008527edfa6731ef23]
[s/memblock_free/memblock_x86_free_range]

xen/p2m: Add logic to revector a P2M tree to use __va leafs.

During bootup Xen supplies us with a P2M array. It sticks
it right after the ramdisk, as can be seen with a 128GB PV guest:

(certain parts removed for clarity):
xc_dom_build_image: called
xc_dom_alloc_segment:   kernel       : 0xffffffff81000000 -> 0xffffffff81e43000  (pfn 0x1000 + 0xe43 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x1000+0xe43 at 0x7f097d8bf000
xc_dom_alloc_segment:   ramdisk      : 0xffffffff81e43000 -> 0xffffffff925c7000  (pfn 0x1e43 + 0x10784 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x1e43+0x10784 at 0x7f0952dd2000
xc_dom_alloc_segment:   phys2mach    : 0xffffffff925c7000 -> 0xffffffffa25c7000  (pfn 0x125c7 + 0x10000 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x125c7+0x10000 at 0x7f0942dd2000
xc_dom_alloc_page   :   start info   : 0xffffffffa25c7000 (pfn 0x225c7)
xc_dom_alloc_page   :   xenstore     : 0xffffffffa25c8000 (pfn 0x225c8)
xc_dom_alloc_page   :   console      : 0xffffffffa25c9000 (pfn 0x225c9)
nr_page_tables: 0x0000ffffffffffff/48: 0xffff000000000000 -> 0xffffffffffffffff, 1 table(s)
nr_page_tables: 0x0000007fffffffff/39: 0xffffff8000000000 -> 0xffffffffffffffff, 1 table(s)
nr_page_tables: 0x000000003fffffff/30: 0xffffffff80000000 -> 0xffffffffbfffffff, 1 table(s)
nr_page_tables: 0x00000000001fffff/21: 0xffffffff80000000 -> 0xffffffffa27fffff, 276 table(s)
xc_dom_alloc_segment:   page tables  : 0xffffffffa25ca000 -> 0xffffffffa26e1000  (pfn 0x225ca + 0x117 pages)
xc_dom_pfn_to_ptr: domU mapping: pfn 0x225ca+0x117 at 0x7f097d7a8000
xc_dom_alloc_page   :   boot stack   : 0xffffffffa26e1000 (pfn 0x226e1)
xc_dom_build_image  : virt_alloc_end : 0xffffffffa26e2000
xc_dom_build_image  : virt_pgtab_end : 0xffffffffa2800000

So the physical memory and virtual (using __START_KERNEL_map addresses)
layout looks as so:

  phys                             __ka
/------------\                   /-------------------\
| 0          | empty             | 0xffffffff80000000|
| ..         |                   | ..                |
| 16MB       | <= kernel starts  | 0xffffffff81000000|
| ..         |                   |                   |
| 30MB       | <= kernel ends => | 0xffffffff81e43000|
| ..         |  & ramdisk starts | ..                |
| 293MB      | <= ramdisk ends=> | 0xffffffff925c7000|
| ..         |  & P2M starts     | ..                |
| ..         |                   | ..                |
| 549MB      | <= P2M ends    => | 0xffffffffa25c7000|
| ..         | start_info        | 0xffffffffa25c7000|
| ..         | xenstore          | 0xffffffffa25c8000|
| ..         | cosole            | 0xffffffffa25c9000|
| 549MB      | <= page tables => | 0xffffffffa25ca000|
| ..         |                   |                   |
| 550MB      | <= PGT end     => | 0xffffffffa26e1000|
| ..         | boot stack        |                   |
\------------/                   \-------------------/

As can be seen, the ramdisk, P2M and pagetables are taking
a bit of __ka addresses space. Which is a problem since the
MODULES_VADDR starts at 0xffffffffa0000000 - and P2M sits
right in there! This results during bootup with the inability to
load modules, with this error:

------------[ cut here ]------------
WARNING: at /home/konrad/ssd/linux/mm/vmalloc.c:106 vmap_page_range_noflush+0x2d9/0x370()
Call Trace:
[<ffffffff810719fa>] warn_slowpath_common+0x7a/0xb0
[<ffffffff81030279>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
[<ffffffff81071a45>] warn_slowpath_null+0x15/0x20
[<ffffffff81130b89>] vmap_page_range_noflush+0x2d9/0x370
[<ffffffff81130c4d>] map_vm_area+0x2d/0x50
[<ffffffff811326d0>] __vmalloc_node_range+0x160/0x250
[<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80
[<ffffffff810c6186>] ? load_module+0x66/0x19c0
[<ffffffff8105cadc>] module_alloc+0x5c/0x60
[<ffffffff810c5369>] ? module_alloc_update_bounds+0x19/0x80
[<ffffffff810c5369>] module_alloc_update_bounds+0x19/0x80
[<ffffffff810c70c3>] load_module+0xfa3/0x19c0
[<ffffffff812491f6>] ? security_file_permission+0x86/0x90
[<ffffffff810c7b3a>] sys_init_module+0x5a/0x220
[<ffffffff815ce339>] system_call_fastpath+0x16/0x1b
---[ end trace fd8f7704fdea0291 ]---
vmalloc: allocation failure, allocated 16384 of 20480 bytes
modprobe: page allocation failure: order:0, mode:0xd2

Since the __va and __ka are 1:1 up to MODULES_VADDR and
cleanup_highmap rids __ka of the ramdisk mapping, what
we want to do is similar - get rid of the P2M in the __ka
address space. There are two ways of fixing this:

1) All P2M lookups instead of using the __ka address would
    use the __va address. This means we can safely erase from
    __ka space the PMD pointers that point to the PFNs for
    P2M array and be OK.
2). Allocate a new array, copy the existing P2M into it,
    revector the P2M tree to use that, and return the old
    P2M to the memory allocate. This has the advantage that
    it sets the stage for using XEN_ELF_NOTE_INIT_P2M
    feature. That feature allows us to set the exact virtual
    address space we want for the P2M - and allows us to
    boot as initial domain on large machines.

So we pick option 2).

This patch only lays the groundwork in the P2M code. The patch
that modifies the MMU is called "xen/mmu: Copy and revector the P2M tree."

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 0b4d1198932f4204d74f6032dce6dbd095fa9531)

xen/mmu: Recycle the Xen provided L4, L3, and L2 pages

As we are not using them. We end up only using the L1 pagetables
and grafting those to our page-tables.

[v1: Per Stefano's suggestion squashed two commits]
[v2: Per Stefano's suggestion simplified loop]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:

arch/x86/xen/mmu.c
[s/memblock_reserve/memblock_x86_reserve-range]
[cherry picked from d950a0fb6d64c4c9f160e3770cef0109e27627b0]

xen/mmu: For 64-bit do not call xen_map_identity_early

B/c we do not need it. During the startup the Xen provides
us with all the memory mapped that we need to function.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit d90be24669f2c39a29f821a654956f30cc9c4ed2)

xen/mmu: use copy_page instead of memcpy.

After all, this is what it is there for.

Acked-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit cbc09be35990fb3d15671507f11c3e90479ef816)

xen/mmu: Provide comments describing the _ka and _va aliasing issue

Which is that the level2_kernel_pgt (__ka virtual addresses)
and level2_ident_pgt (__va virtual address) contain the same
PMD entries. So if you modify a PTE in __ka, it will be reflected
in __va (and vice-versa).

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 26e694dc644c36641d6d73585400caa1f015e1fd)

xen/mmu: The xen_setup_kernel_pagetable doesn't need to return anything.

We don't need to return the new PGD - as we do not use it.

Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit a573e36a3641f268ee6215a7d7cf74610ca5e81a)

Conflicts:

arch/x86/xen/enlighten.c
arch/x86/xen/mmu.c

xen/x86: Use memblock_reserve for sensitive areas.

instead of a big memblock_reserve. This way we can be more
selective in freeing regions (and it also makes it easier
to understand where is what).

[v1: Move the auto_translate_physmap to proper line]
[v2: Per Stefano suggestion add more comments]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[upstream git commit 91addbf07abfdd109a9da4e02061e6ed3728b298]
Conflicts:

arch/x86/xen/setup.c
[s/memblock_reserve/memblock_x86_reserve_range]

xen/p2m: Fix the comment describing the P2M tree.

It mixed up the p2m_mid_missing with p2m_missing. Also
remove some extra spaces.

[upstream git commit 800ea898bbd7f79ef99695f71538f204e24cbcf3]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/perf: Define .glob for the different hypercalls.

This allows us in perf to have this:

99.67%  [kernel]             [k] xen_hypercall_sched_op
  0.11%  [kernel]             [k] xen_hypercall_xen_version

instead of the borring ever-encompassing:

99.13%  [kernel]              [k] hypercall_page

[v2: Use a macro to define the name and skip]
[v3: Use balign per Jan's suggestion]

[upstream git commit 7d0642b93780a7309d2954de6f6126d6ceb526f0]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/p2m: Check __brk_limit before allocating.

The P2M code is smart enough to return false (which means that it
cannot allocate anymore) and the error can perculate up the calling
stack without trouble - with the error logic doing the proper thing.

So check the __brk_limit values before allocating from extend_brk.

This allows us to boot on machines where we do not have enough
__brk space, and we would get this:

(XEN) domain_crash_sync called from entry.S
(XEN) CPU:    0
(XEN) RIP:    e033:[<ffffffff818aad3b>]
(XEN) RFLAGS: 0000000000000206   EM: 1   CONTEXT: pv guest
(XEN) rax: ffffffff81a7c000   rbx: 000000000000003d   rcx: 0000000000001000
(XEN) rdx: ffffffff81a7b000   rsi: 0000000000001000   rdi: 0000000000001000
(XEN) rbp: ffffffff81801cd8   rsp: ffffffff81801c98   r8:  0000000000100000
(XEN) r9:  ffffffff81a7a000   r10: 0000000000000001   r11: 0000000000000003
(XEN) r12: 0000000000000004   r13: 0000000000000004   r14: 000000000000003d
(XEN) r15: 00000000000001e8   cr0: 000000008005003b   cr4: 00000000000006f0
(XEN) cr3: 0000000125803000   cr2: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs: e033
(XEN) Guest stack trace from rsp=ffffffff81801c98:

.. which is extend_brk hitting a BUG_ON.

Note that git commit c3d93f880197953f86ab90d9da4744e926b38e33
(xen: populate correct number of pages when across mem boundary (v2))
exposed this bug).

Interestingly enough, most of the time we are not going to hit this
b/c the _brk space is quite large (v3.5):
ffffffff81a25000 B __brk_base
ffffffff81e43000 B __brk_limit
= ~4MB.

vs earlier kernels (with this back-ported), the space is smaller:
ffffffff81a25000 B __brk_base
ffffffff81a7b000 B __brk_limit
= 344 kBytes.

With this patch, we would get now a limited amount of pages populated back:
Freeing 9f-100 pfn range: 97 pages freed
Freeing b7ee0-ecd9b pfn range: 216763 pages freed
Released 216860 pages of unused memory
Set 295297 page(s) to 1-1 mapping
Populating 100000-134f1c pfn range: 30720 pages added

[while it was instructed to populate 216860 pages back
on this particular machine]

[upstream git commit 6fc0f0142ecf25e3a7e1db52033586107f829af0]

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen PVonHVM: move shared_info to MMIO before kexec

Currently kexec in a PVonHVM guest fails with a triple fault because the
new kernel overwrites the shared info page. The exact failure depends on
the size of the kernel image. This patch moves the pfn from RAM into
MMIO space before the kexec boot.

The pfn containing the shared_info is located somewhere in RAM. This
will cause trouble if the current kernel is doing a kexec boot into a
new kernel. The new kernel (and its startup code) can not know where the
pfn is, so it can not reserve the page. The hypervisor will continue to
update the pfn, and as a result memory corruption occours in the new
kernel.

One way to work around this issue is to allocate a page in the
xen-platform pci device's BAR memory range. But pci init is done very
late and the shared_info page is already in use very early to read the
pvclock. So moving the pfn from RAM to MMIO is racy because some code
paths on other vcpus could access the pfn during the small window when
the old pfn is moved to the new pfn. There is even a small window were
the old pfn is not backed by a mfn, and during that time all reads
return -1.

Because it is not known upfront where the MMIO region is located it can
not be used right from the start in xen_hvm_init_shared_info.

To minimise trouble the move of the pfn is done shortly before kexec.
This does not eliminate the race because all vcpus are still online when
the syscore_ops will be called. But hopefully there is no work pending
at this point in time. Also the syscore_op is run last which reduces the
risk further.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: simplify init_hvm_pv_info

init_hvm_pv_info is called only in PVonHVM context, move it into ifdef.
init_hvm_pv_info does not fail, make it a void function.
remove arguments from init_hvm_pv_info because they are not used by the
caller.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: remove cast from HYPERVISOR_shared_info assignment

Both have type struct shared_info so no cast is needed.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: enable platform-pci only in a Xen guest

While debugging kexec issues in a PVonHVM guest I modified
xen_hvm_platform() to return false to disable all PV drivers. This
caused a crash in platform_pci_init() because it expects certain data
structures to be initialized properly.

To avoid such a crash make sure the driver is initialized only if
running in a Xen guest.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pv-on-hvm kexec: shutdown watches from old kernel

Add xs_reset_watches function to shutdown watches from old kernel after
kexec boot.  The old kernel does not unregister all watches in the
shutdown path.  They are still active, the double registration can not
be detected by the new kernel.  When the watches fire, unexpected events
will arrive and the xenwatch thread will crash (jumps to NULL).  An
orderly reboot of a hvm guest will destroy the entire guest with all its
resources (including the watches) before it is rebuilt from scratch, so
the missing unregister is not an issue in that case.

With this change the xenstored is instructed to wipe all active watches
for the guest.  However, a patch for xenstored is required so that it
accepts the XS_RESET_WATCHES request from a client (see changeset
23839:42a45baf037d in xen-unstable.hg). Without the patch for xenstored
the registration of watches will fail and some features of a PVonHVM
guest are not available. The guest is still able to boot, but repeated
kexec boots will fail.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel"

This reverts commit ddacf5ef684a655abe2bb50c4b2a5b72ae0d5e05.

xen/hvc: Fix up checks when the info is allocated.

Coverity would complain about this - even thought it looks OK.

CID 401957
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:

drivers/tty/hvc/hvc_xen.c

xen/mm: zero PTEs for non-present MFNs in the initial page table

When constructing the initial page tables, if the MFN for a usable PFN
is missing in the p2m then that frame is initially ballooned out. In
this case, zero the PTE (as in decrease_reservation() in
drivers/xen/balloon.c).

This is obviously safe instead of having an valid PTE with an MFN of
INVALID_P2M_ENTRY (~0).

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/mm: do direct hypercall in xen_set_pte() if batching is unavailable

In xen_set_pte() if batching is unavailable (because the caller is in
an interrupt context such as handling a page fault) it would fall back
to using native_set_pte() and trapping and emulating the PTE write.

On 32-bit guests this requires two traps for each PTE write (one for
each dword of the PTE).  Instead, do one mmu_update hypercall
directly.

During construction of the initial page tables, continue to use
native_set_pte() because most of the PTEs being set are in writable
and unpinned pages (see phys_pmd_init() in arch/x86/mm/init_64.c) and
using a hypercall for this is very expensive.

This significantly improves page fault performance in 32-bit PV
guests.

lmbench3 test  Before    After     Improvement
----------------------------------------------
lat_pagefault  3.18 us   2.32 us   27%
lat_proc fork  356 us    313.3 us  11%

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>