Initialize the grant table mapping at the address specified at index 0
in the DT under the /xen node.
After the grant table is initialized, call xenbus_probe (if not dom0).
Changes in v2:
- introduce GRANT_TABLE_PHYSADDR;
- remove unneeded initialization of boot_max_nr_grant_frames.
Jan Beulich [Fri, 3 Feb 2012 15:09:04 +0000 (15:09 +0000)]
xen/tmem: cleanup
Use 'bool' for boolean variables. Do proper section placement.
Eliminate an unnecessary export.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Dan Magenheimer <dan.magenheimer@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 8e6f7c23c135b13f3adf90906fac7edd325bb9af)
Currently, the memory target in the Xen selfballooning driver is mainly
driven by the value of "Committed_AS". However, there are cases in
which it is desirable to assign additional memory to be available for
the kernel, e.g. for local caches (which are not covered by cleancache),
e.g. dcache and inode caches.
This adds an additional tunable in the selfballooning driver (accessible
via sysfs) which allows the user to specify an additional constant
amount of memory to be reserved by the selfballoning driver for the
local domain.
Signed-off-by: Jana Saout <jana@saout.de> Acked-by: Dan Magenheimer <dan.magenheimer@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit d79d5959a023fd637e90ed1ff6547ff09d19396b)
Conflicts:
drivers/xen/xen-selfballoon.c
[UEK2 hasn't done the 'convert sysdev_class to a regular subsystem'
patchset]
Jan Beulich [Wed, 14 Mar 2012 16:34:19 +0000 (12:34 -0400)]
xen: constify all instances of "struct attribute_group"
The functions these get passed to have been taking pointers to const
since at least 2.6.16.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit ead1d01425bbd28c4354b539caa4075bde00ed72)
Dan Magenheimer [Tue, 27 Sep 2011 14:47:58 +0000 (08:47 -0600)]
xen: Fix selfballooning and ensure it doesn't go too far
The balloon driver's "current_pages" is very different from
totalram_pages. Self-ballooning needs to be driven by
the latter. Also, Committed_AS doesn't account for pages
used by the kernel so:
1) Add totalreserve_pages to Committed_AS for the normal target.
2) Enforce a floor for when there are little or no user-space threads
using memory (e.g. single-user mode) to avoid OOMs. The floor
function includes a "min_usable_mb" tuneable in case we discover
later that the floor function is still too aggressive in some
workloads, though likely it will not be needed.
Changes since version 4:
- change floor calculation so that it is not as aggressive; this version
uses a piecewise linear function similar to minimum_target in the 2.6.18
balloon driver, but modified to add to totalreserve_pages instead of
subtract from max_pfn, the 2.6.18 version causes OOMs on recent kernels
because the kernel has expanded over time
- change safety_margin to min_usable_mb and comment on its use
- since committed_as does NOT include kernel space (and other reserved
pages), totalreserve_pages is now added to committed_as. The result is
less aggressive self-ballooning, but theoretically more appropriate.
Changes since version 3:
- missing include causes compile problem when CONFIG_FRONTSWAP is disabled
- add comments after includes
Changes since version 2:
- missing include causes compile problem only on 32-bit
Changes since version 1:
- tuneable safety margin added
[v5: avi.miller@oracle.com: still too aggressive, seeing some OOMs]
[v4: konrad.wilk@oracle.com: fix compile when CONFIG_FRONTSWAP is disabled]
[v3: guru.anbalagane@oracle.com: fix 32-bit compile]
[v2: konrad.wilk@oracle.com: make safety margin tuneable] Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
[v1: Altered description and added an extra include] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 38a1ed4f039db32b418007ac365076cf53647ebd)
Randy Dunlap [Tue, 16 Aug 2011 04:41:43 +0000 (21:41 -0700)]
xen: self-balloon needs module.h
Fix build errors (found when CONFIG_SYSFS is not enabled):
drivers/xen/xen-selfballoon.c:446: warning: data definition has no type or storage class
drivers/xen/xen-selfballoon.c:446: warning: type defaults to 'int' in declaration of 'EXPORT_SYMBOL'
drivers/xen/xen-selfballoon.c:446: warning: parameter names (without types) in function declaration
drivers/xen/xen-selfballoon.c:485: error: expected declaration specifiers or '...' before string constant
drivers/xen/xen-selfballoon.c:485: warning: data definition has no type or storage class
drivers/xen/xen-selfballoon.c:485: warning: type defaults to 'int' in declaration of 'MODULE_LICENSE'
drivers/xen/xen-selfballoon.c:485: warning: function declaration isn't a prototype
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 4fec0e0bde09095b6349dc6206dbf19cebcd0a7e)
With a specific enough .config file compile errors show
for missing workqueue declarations.
Reported-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 0642d2edc858a1f08716bb32e1ab890db8dac246)
Dan Magenheimer [Fri, 8 Jul 2011 18:26:21 +0000 (12:26 -0600)]
xen: tmem: self-ballooning and frontswap-selfshrinking
This patch introduces two in-kernel drivers for Xen transcendent memory
("tmem") functionality that complement cleancache and frontswap. Both
use control theory to dynamically adjust and optimize memory utilization.
Selfballooning controls the in-kernel Xen balloon driver, targeting a goal
value (vm_committed_as), thus pushing less frequently used clean
page cache pages (through the cleancache code) into Xen tmem where
Xen can balance needs across all VMs residing on the physical machine.
Frontswap-selfshrinking controls the number of pages in frontswap,
driving it towards zero (effectively doing a partial swapoff) when
in-kernel memory pressure subsides, freeing up RAM for other VMs.
More detail is provided in the header comment of xen-selfballooning.c.
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
[v8: konrad.wilk@oracle.com: set default enablement depending on frontswap]
[v7: konrad.wilk@oracle.com: fix capitalization and punctuation in comments]
[v6: fix frontswap-selfshrinking initialization]
[v6: konrad.wilk@oracle.com: fix init pr_infos; add comments about swap]
[v5: konrad.wilk@oracle.com: add NULL to attr list; move inits up to decls]
[v4: dkiper@net-space.pl: use strict_strtoul plus a few syntactic nits]
[v3: konrad.wilk@oracle.com: fix potential divides-by-zero]
[v3: konrad.wilk@oracle.com: add many more comments, fix nits]
[v2: rebased to linux-3.0-rc1]
[v2: Ian.Campbell@citrix.com: reorganize as new file (xen-selfballoon.c)]
[v2: dkiper@net-space.pl: proper access to vm_committed_as]
[v2: dkiper@net-space.pl: accounting fixes] Cc: Jan Beulich <JBeulich@novell.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: <xen-devel@lists.xensource.com>
(cherry picked from commit a50777c791031d7345ce95785ea6220f67339d90)
Ian Campbell [Wed, 17 Oct 2012 08:39:10 +0000 (09:39 +0100)]
xen: sysfs: fix build warning.
Define PRI macros for xen_ulong_t and xen_pfn_t and use to fix:
drivers/xen/sys-hypervisor.c:288:4: warning: format '%lx' expects argument of type 'long unsigned int', but argument 3 has type 'xen_ulong_t' [-Wformat]
Ideally this would use PRIx64 on ARM but these (or equivalent) don't
seem to be available in the kernel.
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
All the original Xen headers have xen_ulong_t as unsigned long type, however
when they have been imported in Linux, xen_ulong_t has been replaced with
unsigned long. That might work for x86 and ia64 but it does not for arm.
Bring back xen_ulong_t and let each architecture define xen_ulong_t as they
see fit.
Also explicitly size pointers (__DEFINE_GUEST_HANDLE) to 64 bit.
Changes in v3:
- remove the incorrect changes to multicall_entry;
- remove the change to apic_physbase.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Stefano Stabellini [Wed, 22 Aug 2012 16:20:14 +0000 (17:20 +0100)]
xen: Introduce xen_pfn_t for pfn and mfn types
All the original Xen headers have xen_pfn_t as mfn and pfn type, however
when they have been imported in Linux, xen_pfn_t has been replaced with
unsigned long. That might work for x86 and ia64 but it does not for arm.
Bring back xen_pfn_t and let each architecture define xen_pfn_t as they
see fit.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Add a doc to describe the Xen ARM device tree bindings
Changes in v5:
- add a comment about the size of the grant table memory region;
- add a comment about the required presence of a GIC node;
- specify that the described properties are part of a top-level
"hypervisor" node;
- specify #address-cells and #size-cells for the example.
Changes in v4:
- "xen,xen" should be last as it is less specific;
- update reg property using 2 address-cells and 2 size-cells.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Rob Herring <rob.herring@calxeda.com> CC: devicetree-discuss@lists.ozlabs.org CC: David Vrabel <david.vrabel@citrix.com> CC: Rob Herring <robherring2@gmail.com> CC: Dave Martin <dave.martin@linaro.org>
(cherry picked from commit c43cdfbc4cebdf1a7992432615bf5155b51b8cc0)
Stefano Stabellini [Wed, 8 Aug 2012 16:34:01 +0000 (16:34 +0000)]
xen/arm: sync_bitops
sync_bitops functions are equivalent to the SMP implementation of the
original functions, independently from CONFIG_SMP being defined.
We need them because _set_bit etc are not SMP safe if !CONFIG_SMP. But
under Xen you might be communicating with a completely external entity
who might be on another CPU (e.g. two uniprocessor guests communicating
via event channels and grant tables). So we need a variant of the bit
ops which are SMP safe even on a UP kernel.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit e54d2f61528165bbe0e5f628b111eab3be31c3b5)
Use r12 to pass the hypercall number to the hypervisor.
We need a register to pass the hypercall number because we might not
know it at compile time and HVC only takes an immediate argument.
Among the available registers r12 seems to be the best choice because it
is defined as "intra-procedure call scratch register".
Use the ISS to pass an hypervisor specific tag.
Changes in v2:
- define an HYPERCALL macro for 5 arguments hypercall wrappers, even if
at the moment is unused;
- use ldm instead of pop;
- fix up comments.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit aa2466d21bd9e872690693d56feb946781443f28)
- Basic hypervisor.h and interface.h definitions.
- Skeleton enlighten.c, set xen_start_info to an empty struct.
- Make xen_initial_domain dependent on the SIF_PRIVILIGED_BIT.
The new code only compiles when CONFIG_XEN is set, that is going to be
added to arch/arm/Kconfig in patch #11 "xen/arm: introduce CONFIG_XEN on
ARM".
Changes in v3:
- improve comments.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 4c071ee5268f7234c3d084b6093bebccc28cdcba)
Jan Beulich [Thu, 9 Feb 2012 03:33:51 +0000 (11:33 +0800)]
xen/vga: add the xen EFI video mode support
In order to add xen EFI frambebuffer video support, it is required to add
xen-efi's new video type (XEN_VGATYPE_EFI_LFB) case and handle it in the
function xen_init_vga and set the video type to VIDEO_TYPE_EFI to enable
efi video mode.
The original patch from which this was broken out from:
http://marc.info/?i=4E099AA6020000780004A4C6@nat28.tlf.novell.com
Signed-off-by: Jan Beulich <JBeulich@novell.com> Signed-off-by: Tang Liang <liang.tang@oracle.com>
[v2: The original author is Jan Beulich and Liang Tang ported it to upstream] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit aa387d630cfed1a694a9c8c61fba3877ba8d4f07)
Jeremy Fitzhardinge [Tue, 31 May 2011 14:50:10 +0000 (10:50 -0400)]
xen: allow enable use of VGA console on dom0
Get the information about the VGA console hardware from Xen, and put
it into the form the bootloader normally generates, so that the rest
of the kernel can deal with VGA as usual.
[ Impact: make VGA console work in dom0 ]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
[v1: Rebased on 2.6.39]
[v2: Removed incorrect comments and fixed compile warnings] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit c2419b4a4727f67af2fc2cd68b0d878b75e781bb)
xen/pcifront: Use Xen-SWIOTLB when initting if required.
We piggyback on "xen/swiotlb: Use the swiotlb_late_init_with_tbl to init
Xen-SWIOTLB late when PV PCI is used." functionality to start up
the Xen-SWIOTLB if we are hot-plugged. This allows us to bypass
the need to supply 'iommu=soft' on the Linux command line (mostly).
With this patch, if a user forgot 'iommu=soft' on the command line,
and hotplug a PCI device they will get:
xen/swiotlb: For early initialization, return zero on success.
If everything is setup properly we would return -ENOMEM since
rc by default is set to that value. Lets not do that and return
a proper return code.
Note: The reason the early code needs this special treatment
is that it SWIOTLB library call does not return anything (and
had it failed it would call panic()) - but our function does.
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit c468bdee28a1cb61d9b7a8ce9859d17dee43b7d7)
Konrad Rzeszutek Wilk [Thu, 23 Aug 2012 18:36:15 +0000 (14:36 -0400)]
xen/swiotlb: Use the swiotlb_late_init_with_tbl to init Xen-SWIOTLB late when PV PCI is used.
With this patch we provide the functionality to initialize the
Xen-SWIOTLB late in the bootup cycle - specifically for
Xen PCI-frontend. We still will work if the user had
supplied 'iommu=soft' on the Linux command line.
Note: We cannot depend on after_bootmem to automatically
determine whether this is early or not. This is because
when PCI IOMMUs are initialized it is after after_bootmem but
before a lot of "other" subsystems are initialized.
CC: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
[v1: Fix smatch warnings]
[v2: Added check for xen_swiotlb]
[v3: Rebased with new xen-swiotlb changes]
[v4: squashed xen/swiotlb: Depending on after_bootmem is not correct in] Reviewed-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit b82776005369899c1c7ca2e4b2414bb64b538d2c)
Ronny Hegewald [Fri, 31 Aug 2012 09:57:52 +0000 (09:57 +0000)]
xen: Use correct masking in xen_swiotlb_alloc_coherent.
When running 32-bit pvops-dom0 and a driver tries to allocate a coherent
DMA-memory the xen swiotlb-implementation returned memory beyond 4GB.
The underlaying reason is that if the supplied driver passes in a
DMA_BIT_MASK(64) ( hwdev->coherent_dma_mask is set to 0xffffffffffffffff)
our dma_mask will be u64 set to 0xffffffffffffffff even if we set it to
DMA_BIT_MASK(32) previously. Meaning we do not reset the upper bits.
By using the dma_alloc_coherent_mask function - it does the proper casting
and we get 0xfffffffff.
This caused not working sound on a system with 4 GB and a 64-bit
compatible sound-card with sets the DMA-mask to 64bit.
On bare-metal and the forward-ported xen-dom0 patches from OpenSuse a coherent
DMA-memory is always allocated inside the 32-bit address-range by calling
dma_alloc_coherent_mask.
This patch adds the same functionality to xen swiotlb and is a rebase of the
original patch from Ronny Hegewald which never got upstream b/c the
underlaying reason was not understood until now.
The original email with the original patch is in:
http://old-list-archives.xen.org/archives/html/xen-devel/2010-02/msg00038.html
the original thread from where the discussion started is in:
http://old-list-archives.xen.org/archives/html/xen-devel/2010-01/msg00928.html
Signed-off-by: Ronny Hegewald <ronny.hegewald@online.de> Signed-off-by: Stefano Panella <stefano.panella@citrix.com> Acked-By: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> CC: stable@vger.kernel.org
(cherry picked from commit b5031ed1be0aa419250557123633453753181643)
Konrad Rzeszutek Wilk [Thu, 15 Dec 2011 16:28:46 +0000 (11:28 -0500)]
xen/swiotlb: Use page alignment for early buffer allocation.
This fixes an odd bug found on a Dell PowerEdge 1850/0RC130
(BIOS A05 01/09/2006) where all of the modules doing pci_set_dma_mask
would fail with:
ata_piix 0000:00:1f.1: enabling device (0005 -> 0007)
ata_piix 0000:00:1f.1: can't derive routing for PCI INT A
ata_piix 0000:00:1f.1: BMDMA: failed to set dma mask, falling back to PIO
The issue was the Xen-SWIOTLB was allocated such as that the end of
buffer was stradling a page (and also above 4GB). The fix was
spotted by Kalev Leonid which was to piggyback on git commit e79f86b2ef9c0a8c47225217c1018b7d3d90101c "swiotlb: Use page alignment
for early buffer allocation" which:
We could call free_bootmem_late() if swiotlb is not used, and
it will shrink to page alignment.
So alloc them with page alignment at first, to avoid lose two pages
Konrad Rzeszutek Wilk [Thu, 25 Aug 2011 20:13:54 +0000 (16:13 -0400)]
xen-swiotlb: When doing coherent alloc/dealloc check before swizzling the MFNs.
The process to swizzle a Machine Frame Number (MFN) is not always
necessary. Especially if we know that we actually do not have to do it.
In this patch we check the MFN against the device's coherent
DMA mask and if the requested page(s) are contingous. If it all checks
out we will just return the bus addr without doing the memory
swizzle.
Randy Dunlap [Thu, 11 Aug 2011 20:57:07 +0000 (13:57 -0700)]
xen-swiotlb: fix printk and panic args
Fix printk() and panic() args [swap them] to fix build warnings:
drivers/xen/swiotlb-xen.c:201: warning: format '%s' expects type 'char *', but argument 2 has type 'int'
drivers/xen/swiotlb-xen.c:201: warning: format '%d' expects type 'int', but argument 3 has type 'char *'
drivers/xen/swiotlb-xen.c:202: warning: format '%s' expects type 'char *', but argument 2 has type 'int'
drivers/xen/swiotlb-xen.c:202: warning: format '%d' expects type 'int', but argument 3 has type 'char *'
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 61ca79831ce52c23b3a130f3c2351751e00e0ac9)
Propagate the baremetal git commit "swiotlb: fix wrong panic"
(fba99fa38b023224680308a482e12a0eca87e4e1) in the Xen-SWIOTLB version.
wherein swiotlb's map_page wrongly calls panic() when it can't find
a buffer fit for device's dma mask. It should return an error instead.
Devices with an odd dma mask (i.e. under 4G) like b44 network card hit
this bug (the system crashes):
xen-swiotlb: Retry up three times to allocate Xen-SWIOTLB
We can fail seting up Xen-SWIOTLB if:
- The host does not have enough contiguous DMA32 memory available
(can happen on a machine that has fragmented memory from starting,
stopping many guests).
- Not enough low memory (almost never happens).
We retry allocating and exchanging the swath of contiguous memory
up to three times. Each time we decrease the amount we need - the
minimum being of 2MB.
If we compleltly fail, we will print the reason for failure on the Xen
console on top of doing it to earlyprintk=xen console.
swiotlb: add the late swiotlb initialization function with iotlb memory
This enables the caller to initialize swiotlb with its own iotlb
memory late in the bootup.
See git commit eb605a5754d050a25a9f00d718fb173f24c486ef
"swiotlb: add swiotlb_tbl_map_single library function" which will
explain the full details of what it can be used for.
CC: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
[v1: Fold in smatch warning] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 74838b75379a53678ffc5f59de86161d21e2c808)
xen/swiotlb: With more than 4GB on 64-bit, disable the native SWIOTLB.
If a PV guest is booted the native SWIOTLB should not be
turned on. It does not help us (we don't have any PCI devices)
and it eats 64MB of good memory. In the case of PV guests
with PCI devices we need the Xen-SWIOTLB one.
[v1: Rewrite comment per Stefano's suggestion] Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit fc2341df9e31be8a3940f4e302372d7ef46bab8c)
Its pretty easy:
1). We only check to see if we need Xen SWIOTLB for PV guests.
2). If swiotlb=force or iommu=soft is set, then Xen SWIOTLB will
be enabled.
3). If it is an initial domain, then Xen SWIOTLB will be enabled.
4). Native SWIOTLB must be disabled for PV guests.
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 988f0e24bbcbbf550dff016faf8273a94f4eb1af)
Andres Lagar-Cavilla [Fri, 14 Sep 2012 14:26:59 +0000 (14:26 +0000)]
xen/gndev: Xen backend support for paged out grant targets V4.
Since Xen-4.2, hvm domains may have portions of their memory paged out. When a
foreign domain (such as dom0) attempts to map these frames, the map will
initially fail. The hypervisor returns a suitable errno, and kicks an
asynchronous page-in operation carried out by a helper. The foreign domain is
expected to retry the mapping operation until it eventually succeeds. The
foreign domain is not put to sleep because itself could be the one running the
pager assist (typical scenario for dom0).
This patch adds support for this mechanism for backend drivers using grant
mapping and copying operations. Specifically, this covers the blkback and
gntdev drivers (which map foreign grants), and the netback driver (which copies
foreign grants).
* Add a retry method for grants that fail with GNTST_eagain (i.e. because the
target foreign frame is paged out).
* Insert hooks with appropriate wrappers in the aforementioned drivers.
The retry loop is only invoked if the grant operation status is GNTST_eagain.
It guarantees to leave a new status code different from GNTST_eagain. Any other
status code results in identical code execution as before.
The retry loop performs 256 attempts with increasing time intervals through a
32 second period. It uses msleep to yield while waiting for the next retry.
V2 after feedback from David Vrabel:
* Explicit MAX_DELAY instead of wrap-around delay into zero
* Abstract GNTST_eagain check into core grant table code for netback module.
V3 after feedback from Ian Campbell:
* Add placeholder in array of grant table error descriptions for unrelated
error code we jump over.
* Eliminate single map and retry macro in favor of a generic batch flavor.
* Some renaming.
* Bury most implementation in grant_table.c, cleaner interface.
V4 rebased on top of sync of Xen grant table interface headers.
Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> Acked-by: Ian Campbell <ian.campbell@citrix.com>
[v5: Fixed whitespace issues] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit c571898ffc24a1768e1b2dabeac0fc7dd4c14601)
Stefano Stabellini [Wed, 22 Aug 2012 16:20:15 +0000 (17:20 +0100)]
xen: clear IRQ_NOAUTOEN and IRQ_NOREQUEST
Reset the IRQ_NOAUTOEN and IRQ_NOREQUEST flags that are enabled by
default on ARM. If IRQ_NOAUTOEN is set, __setup_irq doesn't call
irq_startup, that is responsible for calling irq_unmask at startup time.
As a result event channels remain masked.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit a8636c0b2e57d4f31f71aa306b1ee701db3f3c85)
Stefano Stabellini [Wed, 22 Aug 2012 16:20:11 +0000 (17:20 +0100)]
xen/events: fix unmask_evtchn for PV on HVM guests
When unmask_evtchn is called, if we already have an event pending, we
just set evtchn_pending_sel waiting for local_irq_enable to be called.
That is because PV guests set the irq_enable pvops to
xen_irq_enable_direct in xen_setup_vcpu_info_placement:
xen_irq_enable_direct is implemented in assembly in
arch/x86/xen/xen-asm.S and call xen_force_evtchn_callback if
XEN_vcpu_info_pending is set.
However HVM guests (and ARM guests) do not change or do not have the
irq_enable pvop, so evtchn_unmask cannot work properly for them.
Considering that having the pending_irq bit set when unmask_evtchn is
called is not very common, and it is simpler to keep the
native_irq_enable implementation for HVM guests (and ARM guests), the
best thing to do is just use the EVTCHNOP_unmask hypercall (Xen
re-injects pending events in response).
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit b5e579232d635b79a3da052964cb357ccda8d9ea)
Mats Petersson [Fri, 16 Nov 2012 18:36:49 +0000 (18:36 +0000)]
xen/privcmd: Correctly return success from IOCTL_PRIVCMD_MMAPBATCH
This is a regression introduced by ceb90fa0 (xen/privcmd: add
PRIVCMD_MMAPBATCH_V2 ioctl). It broke xentrace as it used
xc_map_foreign() instead of xc_map_foreign_bulk().
Most code-paths prefer the MMAPBATCH_V2, so this wasn't very obvious
that it broke. The return value is set early on to -EINVAL, and if all
goes well, the "set top bits of the MFN's" never gets called, so the
return value is still EINVAL when the function gets to the end, causing
the caller to think it went wrong (which it didn't!)
Now also including Andres "move the ret = -EINVAL into the error handling
path, as this avoids other similar errors in future.
Signed-off-by: Mats Petersson <mats.petersson@citrix.com> Acked-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> Acked-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 68fa965dd923177eafad49b7a0045fc610917341)
Konrad Rzeszutek Wilk [Wed, 31 Oct 2012 16:38:31 +0000 (12:38 -0400)]
xen/mmu: Use Xen specific TLB flush instead of the generic one.
As Mukesh explained it, the MMUEXT_TLB_FLUSH_ALL allows the
hypervisor to do a TLB flush on all active vCPUs. If instead
we were using the generic one (which ends up being xen_flush_tlb)
we end up making the MMUEXT_TLB_FLUSH_LOCAL hypercall. But
before we make that hypercall the kernel will IPI all of the
vCPUs (even those that were asleep from the hypervisor
perspective). The end result is that we needlessly wake them
up and do a TLB flush when we can just let the hypervisor
do it correctly.
This patch gives around 50% speed improvement when migrating
idle guest's from one host to another.
xen/enlighten: Disable MWAIT_LEAF so that acpi-pad won't be loaded.
There are exactly four users of __monitor and __mwait:
- cstate.c (which allows acpi_processor_ffh_cstate_enter to be called
when the cpuidle API drivers are used. However patch
"cpuidle: replace xen access to x86 pm_idle and default_idle"
provides a mechanism to disable the cpuidle and use safe_halt.
- smpboot (which allows mwait_play_dead to be called). However
safe_halt is always used so we skip that.
- intel_idle (same deal as above).
- acpi_pad.c. This the one that we do not want to run as we
will hit the below crash.
Why do we want to expose MWAIT_LEAF in the first place?
We want it for the xen-acpi-processor driver - which uploads
C-states to the hypervisor. If MWAIT_LEAF is set, the cstate.c
sets the proper address in the C-states so that the hypervisor
can benefit from using the MWAIT functionality. And that is
the sole reason for using it.
Without this patch, if a module performs mwait or monitor we
get this:
Stub out MSR methods that aren't actually needed. This fixes a crash
as Xen Dom0 on AMD Trinity systems. A bigger patch should be added to
remove the paravirt machinery completely for the methods which
apparently have no users!
Andre Przywara [Tue, 29 May 2012 11:07:31 +0000 (13:07 +0200)]
xen/setup: filter APERFMPERF cpuid feature out
Xen PV kernels allow access to the APERF/MPERF registers to read the
effective frequency. Access to the MSRs is however redirected to the
currently scheduled physical CPU, making consecutive read and
compares unreliable. In addition each rdmsr traps into the hypervisor.
So to avoid bogus readouts and expensive traps, disable the kernel
internal feature flag for APERF/MPERF if running under Xen.
This will
a) remove the aperfmperf flag from /proc/cpuinfo
b) not mislead the power scheduler (arch/x86/kernel/cpu/sched.c) to
use the feature to improve scheduling (by default disabled)
c) not mislead the cpufreq driver to use the MSRs
This does not cover userland programs which access the MSRs via the
device file interface, but this will be addressed separately.
Signed-off-by: Andre Przywara <andre.przywara@amd.com> Cc: stable@vger.kernel.org # v3.0+ Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 5e626254206a709c6e937f3dda69bf26c7344f6f)
Konrad Rzeszutek Wilk [Tue, 14 Feb 2012 03:26:32 +0000 (22:26 -0500)]
xen/enlighten: Expose MWAIT and MWAIT_LEAF if hypervisor OKs it.
For the hypervisor to take advantage of the MWAIT support it needs
to extract from the ACPI _CST the register address. But the
hypervisor does not have the support to parse DSDT so it relies on
the initial domain (dom0) to parse the ACPI Power Management information
and push it up to the hypervisor. The pushing of the data is done
by the processor_harveset_xen module which parses the information that
the ACPI parser has graciously exposed in 'struct acpi_processor'.
For the ACPI parser to also expose the Cx states for MWAIT, we need
to expose the MWAIT capability (leaf 1). Furthermore we also need to
expose the MWAIT_LEAF capability (leaf 5) for cstate.c to properly
function.
The hypervisor could expose these flags when it traps the XEN_EMULATE_PREFIX
operations, but it can't do it since it needs to be backwards compatible.
Instead we choose to use the native CPUID to figure out if the MWAIT
capability exists and use the XEN_SET_PDC query hypercall to figure out
if the hypervisor wants us to expose the MWAIT_LEAF capability or not.
Note: The XEN_SET_PDC query was implemented in c/s 23783:
"ACPI: add _PDC input override mechanism".
With this in place, instead of
C3 ACPI IOPORT 415
we get now
C3:ACPI FFH INTEL MWAIT 0x20
Note: The cpu_idle which would be calling the mwait variants for idling
never gets set b/c we set the default pm_idle to be the hypercall variant.
Acked-by: Jan Beulich <JBeulich@suse.com>
[v2: Fix missing header file include and #ifdef] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 73c154c60be106b47f15d1111fc2d75cc7a436f2)
This file depends on <xen/xen.h>, but the dependency was hidden due
to: <asm/acpi.h> -> <asm/trampoline.h> -> <asm/io.h> -> <xen/xen.h>
With the removal of <asm/trampoline.h>, this exposed the missing
Reported-by: Ingo Molnar <mingo@kernel.org> Cc: Len Brown <lenb@kernel.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Jarkko Sakkinen <jarkko.sakkinen@intel.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
(cherry picked from commit 323f90a60864f30fd6b7c99806584bb90ada1a29)
We did a similar check for the P-states but did not do it for
the C-states. What we want to do is ignore cases where the DSDT
has definition for sixteen CPUs, but the machine only has eight
CPUs and we get:
xen-acpi-processor: (CX): Hypervisor error (-22) for ACPI CPU14
Reported-by: Tobias Geiger <tobias.geiger@vido.info> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit b930fe5e1f5646e071facda70b25b137ebeae5af)
Konrad Rzeszutek Wilk [Fri, 3 Feb 2012 21:03:20 +0000 (16:03 -0500)]
xen/acpi-processor: C and P-state driver that uploads said data to hypervisor.
This driver solves three problems:
1). Parse and upload ACPI0007 (or PROCESSOR_TYPE) information to the
hypervisor - aka P-states (cpufreq data).
2). Upload the the Cx state information (cpuidle data).
3). Inhibit CPU frequency scaling drivers from loading.
The reason for wanting to solve 1) and 2) is such that the Xen hypervisor
is the only one that knows the CPU usage of different guests and can
make the proper decision of when to put CPUs and packages in proper states.
Unfortunately the hypervisor has no support to parse ACPI DSDT tables, hence it
needs help from the initial domain to provide this information. The reason
for 3) is that we do not want the initial domain to change P-states while the
hypervisor is doing it as well - it causes rather some funny cases of P-states
transitions.
For this to work, the driver parses the Power Management data and uploads said
information to the Xen hypervisor. It also calls acpi_processor_notify_smm()
to inhibit the other CPU frequency scaling drivers from being loaded.
Everything revolves around the 'struct acpi_processor' structure which
gets updated during the bootup cycle in different stages. At the startup, when
the ACPI parser starts, the C-state information is processed (processor_idle)
and saved in said structure as 'power' element. Later on, the CPU frequency
scaling driver (powernow-k8 or acpi_cpufreq), would call the the
acpi_processor_* (processor_perflib functions) to parse P-states information
and populate in the said structure the 'performance' element.
Since we do not want the CPU frequency scaling drivers from loading
we have to call the acpi_processor_* functions to parse the P-states and
call "acpi_processor_notify_smm" to stop them from loading.
There is also one oddity in this driver which is that under Xen, the
physical online CPU count can be different from the virtual online CPU count.
Meaning that the macros 'for_[online|possible]_cpu' would process only
up to virtual online CPU count. We on the other hand want to process
the full amount of physical CPUs. For that, the driver checks if the ACPI IDs
count is different from the APIC ID count - which can happen if the user
choose to use dom0_max_vcpu argument. In such a case a backup of the PM
structure is used and uploaded to the hypervisor.
[v1-v2: Initial RFC implementations that were posted]
[v3: Changed the name to passthru suggested by Pasi Kärkkäinen <pasik@iki.fi>]
[v4: Added vCPU != pCPU support - aka dom0_max_vcpus support]
[v5: Cleaned up the driver, fix bug under Athlon XP]
[v6: Changed the driver to a CPU frequency governor]
[v7: Jan Beulich <jbeulich@suse.com> suggestion to make it a cpufreq scaling driver
made me rework it as driver that inhibits cpufreq scaling driver]
[v8: Per Jan's review comments, fixed up the driver]
[v9: Allow to continue even if acpi_processor_preregister_perf.. fails] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 59a56802918100c1e39e68c30a2e5ae9f7d837f0)
Yu Ke [Wed, 24 Mar 2010 18:01:13 +0000 (11:01 -0700)]
xen/acpi: Domain0 acpi parser related platform hypercall
This patches implements the xen_platform_op hypercall, to pass the parsed
ACPI info to hypervisor.
Signed-off-by: Yu Ke <ke.yu@intel.com> Signed-off-by: Tian Kevin <kevin.tian@intel.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
[v1: Added DEFINE_GUEST.. in appropiate headers]
[v2: Ripped out typedefs] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Ian Campbell [Fri, 14 Sep 2012 07:19:01 +0000 (08:19 +0100)]
xen: resynchronise grant table status codes with upstream
Adds GNTST_address_too_big and GNTST_eagain.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit e58f5b55113b8fd4eb8eb43f5508d87e4862f280)
Dan Carpenter [Sat, 8 Sep 2012 09:57:35 +0000 (12:57 +0300)]
xen/privcmd: return -EFAULT on error
__copy_to_user() returns the number of bytes remaining to be copied but
we want to return a negative error code here.
Acked-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 9d2be9287107695708e6aae5105a8a518a6cb4d0)
Andres Lagar-Cavilla [Thu, 6 Sep 2012 17:24:39 +0000 (13:24 -0400)]
xen/privcmd: Fix mmap batch ioctl error status copy back.
Copy back of per-slot error codes is only necessary for V2. V1 does not provide
an error array, so copyback will unconditionally set the global rc to EFAULT.
Only copyback for V2.
Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 1714df7f2cee6a741c3ed24231ec5db25b90633a)
Andres Lagar-Cavilla [Fri, 31 Aug 2012 13:59:30 +0000 (09:59 -0400)]
xen/privcmd: add PRIVCMD_MMAPBATCH_V2 ioctl
PRIVCMD_MMAPBATCH_V2 extends PRIVCMD_MMAPBATCH with an additional
field for reporting the error code for every frame that could not be
mapped. libxc prefers PRIVCMD_MMAPBATCH_V2 over PRIVCMD_MMAPBATCH.
Also expand PRIVCMD_MMAPBATCH to return appropriate error-encoding top nibble
in the mfn array.
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit ceb90fa0a8008059ecbbf9114cb89dc71a730bb6)
David Vrabel [Thu, 30 Aug 2012 12:58:11 +0000 (13:58 +0100)]
xen/mm: return more precise error from xen_remap_domain_range()
Callers of xen_remap_domain_range() need to know if the remap failed
because frame is currently paged out. So they can retry the remap
later on. Return -ENOENT in this case.
This assumes that the error codes returned by Xen are a subset of
those used by the kernel. It is unclear if this is defined as part of
the hypercall ABI.
Acked-by: Andres Lagar-Cavilla <andres@lagarcavilla.org> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 69870a847856a1ba81f655a8633fce5f5b614730)
Konrad Rzeszutek Wilk [Mon, 13 Aug 2012 15:00:08 +0000 (11:00 -0400)]
xen/swiotlb: Fix compile warnings when using plain integer instead of NULL pointer.
arch/x86/xen/pci-swiotlb-xen.c:96:1: warning: Using plain integer as NULL pointer
arch/x86/xen/pci-swiotlb-xen.c:96:1: warning: Using plain integer as NULL pointer
Konrad Rzeszutek Wilk [Mon, 13 Aug 2012 17:26:11 +0000 (13:26 -0400)]
xen/swiotlb: Remove functions not needed anymore.
Sparse warns us off:
drivers/xen/swiotlb-xen.c:506:1: warning: symbol 'xen_swiotlb_map_sg' was not declared. Should it be static?
drivers/xen/swiotlb-xen.c:534:1: warning: symbol 'xen_swiotlb_unmap_sg' was not declared. Should it be static?
and it looks like we do not need this function at all.
Stefano Stabellini [Wed, 22 Aug 2012 16:20:16 +0000 (17:20 +0100)]
xen: allow privcmd for HVM guests
This patch removes the "return -ENOSYS" for auto_translated_physmap
guests from privcmd_mmap, thus it allows ARM guests to issue privcmd
mmap calls. However privcmd mmap calls are still going to fail for HVM
and hybrid guests on x86 because the xen_remap_domain_mfn_range
implementation is currently PV only.
Changes in v2:
- better commit message;
- return -EINVAL from xen_remap_domain_mfn_range if
auto_translated_physmap.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 1a1d43318aeb74d679372c0b65029957be274529)
Daniel De Graaf [Thu, 16 Aug 2012 20:40:26 +0000 (16:40 -0400)]
xen/sysfs: Use XENVER_guest_handle to query UUID
This hypercall has been present since Xen 3.1, and is the preferred
method for a domain to obtain its UUID. Fall back to the xenstore method
if using an older version of Xen (which returns -ENOSYS).
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 5c13f8067745efc15f6ad0158b58d57c44104c25)
Stefano Stabellini [Mon, 6 Aug 2012 14:27:24 +0000 (15:27 +0100)]
xen: update xen_add_to_physmap interface
Update struct xen_add_to_physmap to be in sync with Xen's version of the
structure.
The size field was introduced by:
changeset: 24164:707d27fe03e7
user: Jean Guyader <jean.guyader@eu.citrix.com>
date: Fri Nov 18 13:42:08 2011 +0000
summary: mm: New XENMEM space, XENMAPSPACE_gmfn_range
According to the comment:
"This new field .size is located in the 16 bits padding between .domid
and .space in struct xen_add_to_physmap to stay compatible with older
versions."
Changes in v2:
- remove erroneous comment in the commit message.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit b58aaa4b0b3506c094308342d746f600468c63d9)
xen/p2m: Fix one by off error in checking the P2M tree directory.
We would traverse the full P2M top directory (from 0->MAX_DOMAIN_PAGES
inclusive) when trying to figure out whether we can re-use some of the
P2M middle leafs.
Which meant that if the kernel was compiled with MAX_DOMAIN_PAGES=512
we would try to use the 512th entry. Fortunately for us the p2m_top_index
has a check for this:
Konrad Rzeszutek Wilk [Thu, 16 Aug 2012 20:38:55 +0000 (16:38 -0400)]
xen/p2m: When revectoring deal with holes in the P2M array.
When we free the PFNs and then subsequently populate them back
during bootup:
Freeing 20000-20200 pfn range: 512 pages freed
1-1 mapping on 20000->20200
Freeing 40000-40200 pfn range: 512 pages freed
1-1 mapping on 40000->40200
Freeing bad80-badf4 pfn range: 116 pages freed
1-1 mapping on bad80->badf4
Freeing badf6-bae7f pfn range: 137 pages freed
1-1 mapping on badf6->bae7f
Freeing bb000-100000 pfn range: 282624 pages freed
1-1 mapping on bb000->100000
Released 283999 pages of unused memory
Set 283999 page(s) to 1-1 mapping
Populating 1acb8a-1f20e9 pfn range: 283999 pages added
We end up having the P2M array (that is the one that was
grafted on the P2M tree) filled with IDENTITY_FRAME or
INVALID_P2M_ENTRY) entries. The patch titled
"xen/p2m: Reuse existing P2M leafs if they are filled with 1:1 PFNs or INVALID."
recycles said slots and replaces the P2M tree leaf's with
&mfn_list[xx] with p2m_identity or p2m_missing.
And re-uses the P2M array sections for other P2M tree leaf's.
For the above mentioned bootup excerpt, the PFNs at
0x20000->0x20200 are going to be IDENTITY based:
P2M[0][256][0] -> P2M[0][257][0] get turned in IDENTITY_FRAME.
We can re-use that and replace P2M[0][256] to point to p2m_identity.
The "old" page (the grafted P2M array provided by Xen) that was at
P2M[0][256] gets put somewhere else. Specifically at P2M[6][358],
b/c when we populate back:
we fill P2M[6][358][0] (and P2M[6][358], P2M[6][359], ...) with
the new MFNs.
That is all OK, except when we revector we assume that the PFN
count would be the same in the grafted P2M array and in the
newly allocated. Since that is no longer the case, as we have
holes in the P2M that point to p2m_missing or p2m_identity we
have to take that into account.
[v2: Check for overflow]
[v3: Move within the __va check]
[v4: Fix the computation] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 3fc509fc0c590900568ef516a37101d88f3476f5)
Konrad Rzeszutek Wilk [Fri, 17 Aug 2012 13:35:31 +0000 (09:35 -0400)]
xen/mmu: If the revector fails, don't attempt to revector anything else.
If the P2M revectoring would fail, we would try to continue on by
cleaning the PMD for L1 (PTE) page-tables. The xen_cleanhighmap
is greedy and erases the PMD on both boundaries. Since the P2M
array can share the PMD, we would wipe out part of the __ka
that is still used in the P2M tree to point to P2M leafs.
This fixes it by bypassing the revectoring and continuing on.
If the revector fails, a nice WARN is printed so we can still
troubleshoot this.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Thu, 16 Aug 2012 20:38:55 +0000 (16:38 -0400)]
xen/p2m: When revectoring deal with holes in the P2M array.
When we free the PFNs and then subsequently populate them back
during bootup:
Freeing 20000-20200 pfn range: 512 pages freed
1-1 mapping on 20000->20200
Freeing 40000-40200 pfn range: 512 pages freed
1-1 mapping on 40000->40200
Freeing bad80-badf4 pfn range: 116 pages freed
1-1 mapping on bad80->badf4
Freeing badf6-bae7f pfn range: 137 pages freed
1-1 mapping on badf6->bae7f
Freeing bb000-100000 pfn range: 282624 pages freed
1-1 mapping on bb000->100000
Released 283999 pages of unused memory
Set 283999 page(s) to 1-1 mapping
Populating 1acb8a-1f20e9 pfn range: 283999 pages added
We end up having the P2M array (that is the one that was
grafted on the P2M tree) filled with IDENTITY_FRAME or
INVALID_P2M_ENTRY) entries. The patch titled
"xen/p2m: Reuse existing P2M leafs if they are filled with 1:1 PFNs or INVALID."
recycles said slots and replaces the P2M tree leaf's with
&mfn_list[xx] with p2m_identity or p2m_missing.
And re-uses the P2M array sections for other P2M tree leaf's.
For the above mentioned bootup excerpt, the PFNs at
0x20000->0x20200 are going to be IDENTITY based:
P2M[0][256][0] -> P2M[0][257][0] get turned in IDENTITY_FRAME.
We can re-use that and replace P2M[0][256] to point to p2m_identity.
The "old" page (the grafted P2M array provided by Xen) that was at
P2M[0][256] gets put somewhere else. Specifically at P2M[6][358],
b/c when we populate back:
we fill P2M[6][358][0] (and P2M[6][358], P2M[6][359], ...) with
the new MFNs.
That is all OK, except when we revector we assume that the PFN
count would be the same in the grafted P2M array and in the
newly allocated. Since that is no longer the case, as we have
holes in the P2M that point to p2m_missing or p2m_identity we
have to take that into account.
[v2: Check for overflow] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Fri, 17 Aug 2012 13:27:35 +0000 (09:27 -0400)]
xen/p2m: Reuse existing P2M leafs if they are filled with 1:1 PFNs or INVALID.
If P2M leaf is completly packed with INVALID_P2M_ENTRY or with
1:1 PFNs (so IDENTITY_FRAME type PFNs), we can swap the P2M leaf
with either a p2m_missing or p2m_identity respectively. The old
page (which was created via extend_brk or was grafted on from the
mfn_list) can be re-used for setting new PFNs.
This also means we can remove git commit: 5bc6f9888db5739abfa0cae279b4b442e4db8049
xen/p2m: Reserve 8MB of _brk space for P2M leafs when populating back
which tried to fix this.
and make the amount that is required to be reserved much smaller.
CC: stable@vger.kernel.org # for 3.5 only. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Tue, 14 Aug 2012 20:37:31 +0000 (16:37 -0400)]
xen/mmu: Release just the MFN list, not MFN list and part of pagetables.
We call memblock_reserve for [start of mfn list] -> [PMD aligned end
of mfn list] instead of <start of mfn list> -> <page aligned end of mfn list].
This has the disastrous effect that if at bootup the end of mfn_list is
not PMD aligned we end up returning to memblock parts of the region
past the mfn_list array. And those parts are the PTE tables with
the disastrous effect of seeing this at bootup:
Write protecting the kernel read-only data: 10240k
Freeing unused kernel memory: 1860k freed
Freeing unused kernel memory: 200k freed
(XEN) mm.c:2429:d0 Bad type (saw 1400000000000002 != exp 7000000000000000) for mfn 116a80 (pfn 14e26)
...
(XEN) mm.c:908:d0 Error getting mfn 116a83 (pfn 14e2a) from L1 entry 8000000116a83067 for l1e_owner=0, pg_owner=0
(XEN) mm.c:908:d0 Error getting mfn 4040 (pfn 5555555555555555) from L1 entry 0000000004040601 for l1e_owner=0, pg_owner=0
.. and so on.
This is not for upstream as it memblock_x86_reserve_range is not
used upstream anymore.
When I back-ported the patches:
xen/x86: Use memblock_reserve for sensitive areas.
xen/mmu: Recycle the Xen provided L4, L3, and L2 pages
I simply used sed s/memblock_reserve/memblock_x86_reserve_range/.
That was incorrect as the parameters are different - memblock_reserve
as second expects the size, while memblock_x86_reserve_range expects
the physical address. This patch fixes those bugs.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
We then try to populate those pages back. In the P2M tree however
the space for those leafs must be reserved - as such we use extend_brk.
We reserve 8MB of _brk space, which means we can fit over 1048576 PFNs - which is more than we should ever need.
[v1: Made it 8MB of _brk space instead of 4MB per Jan's suggestion] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 99266871de5006ba7ad0bfece6bb283ede4094b9)
xen/mmu: Remove from __ka space PMD entries for pagetables.
Please first read the description in "xen/mmu: Copy and revector the
P2M tree."
At this stage, the __ka address space (which is what the old
P2M tree was using) is partially disassembled. The cleanup_highmap
has removed the PMD entries from 0-16MB and anything past _brk_end
up to the max_pfn_mapped (which is the end of the ramdisk).
The xen_remove_p2m_tree and code around has ripped out the __ka for
the old P2M array.
Here we continue on doing it to where the Xen page-tables were.
It is safe to do it, as the page-tables are addressed using __va.
For good measure we delete anything that is within MODULES_VADDR
and up to the end of the PMD.
At this point the __ka only contains PMD entries for the start
of the kernel up to __brk.
[v1: Per Stefano's suggestion wrapped the MODULES_VADDR in debug] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 4e928e1a48b6b76e0b8384160213a32d03197e4b)
Please first read the description in "xen/p2m: Add logic to revector a
P2M tree to use __va leafs" patch.
The 'xen_revector_p2m_tree()' function allocates a new P2M tree
copies the contents of the old one in it, and returns the new one.
At this stage, the __ka address space (which is what the old
P2M tree was using) is partially disassembled. The cleanup_highmap
has removed the PMD entries from 0-16MB and anything past _brk_end
up to the max_pfn_mapped (which is the end of the ramdisk).
We have revectored the P2M tree (and the one for save/restore as well)
to use new shiny __va address to new MFNs. The xen_start_info
has been taken care of already in 'xen_setup_kernel_pagetable()' and
xen_start_info->shared_info in 'xen_setup_shared_info()', so
we are free to roam and delete PMD entries - which is exactly what
we are going to do. We rip out the __ka for the old P2M array.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
As can be seen, the ramdisk, P2M and pagetables are taking
a bit of __ka addresses space. Which is a problem since the
MODULES_VADDR starts at 0xffffffffa0000000 - and P2M sits
right in there! This results during bootup with the inability to
load modules, with this error:
Since the __va and __ka are 1:1 up to MODULES_VADDR and
cleanup_highmap rids __ka of the ramdisk mapping, what
we want to do is similar - get rid of the P2M in the __ka
address space. There are two ways of fixing this:
1) All P2M lookups instead of using the __ka address would
use the __va address. This means we can safely erase from
__ka space the PMD pointers that point to the PFNs for
P2M array and be OK.
2). Allocate a new array, copy the existing P2M into it,
revector the P2M tree to use that, and return the old
P2M to the memory allocate. This has the advantage that
it sets the stage for using XEN_ELF_NOTE_INIT_P2M
feature. That feature allows us to set the exact virtual
address space we want for the P2M - and allows us to
boot as initial domain on large machines.
So we pick option 2).
This patch only lays the groundwork in the P2M code. The patch
that modifies the MMU is called "xen/mmu: Copy and revector the P2M tree."
xen/mmu: Recycle the Xen provided L4, L3, and L2 pages
As we are not using them. We end up only using the L1 pagetables
and grafting those to our page-tables.
[v1: Per Stefano's suggestion squashed two commits]
[v2: Per Stefano's suggestion simplified loop] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
Acked-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit cbc09be35990fb3d15671507f11c3e90479ef816)
xen/mmu: Provide comments describing the _ka and _va aliasing issue
Which is that the level2_kernel_pgt (__ka virtual addresses)
and level2_ident_pgt (__va virtual address) contain the same
PMD entries. So if you modify a PTE in __ka, it will be reflected
in __va (and vice-versa).
xen/x86: Use memblock_reserve for sensitive areas.
instead of a big memblock_reserve. This way we can be more
selective in freeing regions (and it also makes it easier
to understand where is what).
[v1: Move the auto_translate_physmap to proper line]
[v2: Per Stefano suggestion add more comments] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[upstream git commit 91addbf07abfdd109a9da4e02061e6ed3728b298]
Conflicts:
The P2M code is smart enough to return false (which means that it
cannot allocate anymore) and the error can perculate up the calling
stack without trouble - with the error logic doing the proper thing.
So check the __brk_limit values before allocating from extend_brk.
This allows us to boot on machines where we do not have enough
__brk space, and we would get this:
Interestingly enough, most of the time we are not going to hit this
b/c the _brk space is quite large (v3.5): ffffffff81a25000 B __brk_base ffffffff81e43000 B __brk_limit
= ~4MB.
vs earlier kernels (with this back-ported), the space is smaller: ffffffff81a25000 B __brk_base ffffffff81a7b000 B __brk_limit
= 344 kBytes.
With this patch, we would get now a limited amount of pages populated back:
Freeing 9f-100 pfn range: 97 pages freed
Freeing b7ee0-ecd9b pfn range: 216763 pages freed
Released 216860 pages of unused memory
Set 295297 page(s) to 1-1 mapping
Populating 100000-134f1c pfn range: 30720 pages added
[while it was instructed to populate 216860 pages back
on this particular machine]
Olaf Hering [Tue, 17 Jul 2012 15:43:35 +0000 (17:43 +0200)]
xen PVonHVM: move shared_info to MMIO before kexec
Currently kexec in a PVonHVM guest fails with a triple fault because the
new kernel overwrites the shared info page. The exact failure depends on
the size of the kernel image. This patch moves the pfn from RAM into
MMIO space before the kexec boot.
The pfn containing the shared_info is located somewhere in RAM. This
will cause trouble if the current kernel is doing a kexec boot into a
new kernel. The new kernel (and its startup code) can not know where the
pfn is, so it can not reserve the page. The hypervisor will continue to
update the pfn, and as a result memory corruption occours in the new
kernel.
One way to work around this issue is to allocate a page in the
xen-platform pci device's BAR memory range. But pci init is done very
late and the shared_info page is already in use very early to read the
pvclock. So moving the pfn from RAM to MMIO is racy because some code
paths on other vcpus could access the pfn during the small window when
the old pfn is moved to the new pfn. There is even a small window were
the old pfn is not backed by a mfn, and during that time all reads
return -1.
Because it is not known upfront where the MMIO region is located it can
not be used right from the start in xen_hvm_init_shared_info.
To minimise trouble the move of the pfn is done shortly before kexec.
This does not eliminate the race because all vcpus are still online when
the syscore_ops will be called. But hopefully there is no work pending
at this point in time. Also the syscore_op is run last which reduces the
risk further.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Olaf Hering [Tue, 17 Jul 2012 09:59:15 +0000 (11:59 +0200)]
xen: simplify init_hvm_pv_info
init_hvm_pv_info is called only in PVonHVM context, move it into ifdef.
init_hvm_pv_info does not fail, make it a void function.
remove arguments from init_hvm_pv_info because they are not used by the
caller.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Olaf Hering [Tue, 10 Jul 2012 13:31:39 +0000 (15:31 +0200)]
xen: enable platform-pci only in a Xen guest
While debugging kexec issues in a PVonHVM guest I modified
xen_hvm_platform() to return false to disable all PV drivers. This
caused a crash in platform_pci_init() because it expects certain data
structures to be initialized properly.
To avoid such a crash make sure the driver is initialized only if
running in a Xen guest.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Olaf Hering [Tue, 10 Jul 2012 12:50:03 +0000 (14:50 +0200)]
xen/pv-on-hvm kexec: shutdown watches from old kernel
Add xs_reset_watches function to shutdown watches from old kernel after
kexec boot. The old kernel does not unregister all watches in the
shutdown path. They are still active, the double registration can not
be detected by the new kernel. When the watches fire, unexpected events
will arrive and the xenwatch thread will crash (jumps to NULL). An
orderly reboot of a hvm guest will destroy the entire guest with all its
resources (including the watches) before it is rebuilt from scratch, so
the missing unregister is not an issue in that case.
With this change the xenstored is instructed to wipe all active watches
for the guest. However, a patch for xenstored is required so that it
accepts the XS_RESET_WATCHES request from a client (see changeset
23839:42a45baf037d in xen-unstable.hg). Without the patch for xenstored
the registration of watches will fail and some features of a PVonHVM
guest are not available. The guest is still able to boot, but repeated
kexec boots will fail.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
David Vrabel [Mon, 9 Jul 2012 10:39:06 +0000 (11:39 +0100)]
xen/mm: zero PTEs for non-present MFNs in the initial page table
When constructing the initial page tables, if the MFN for a usable PFN
is missing in the p2m then that frame is initially ballooned out. In
this case, zero the PTE (as in decrease_reservation() in
drivers/xen/balloon.c).
This is obviously safe instead of having an valid PTE with an MFN of
INVALID_P2M_ENTRY (~0).
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
David Vrabel [Mon, 9 Jul 2012 10:39:05 +0000 (11:39 +0100)]
xen/mm: do direct hypercall in xen_set_pte() if batching is unavailable
In xen_set_pte() if batching is unavailable (because the caller is in
an interrupt context such as handling a page fault) it would fall back
to using native_set_pte() and trapping and emulating the PTE write.
On 32-bit guests this requires two traps for each PTE write (one for
each dword of the PTE). Instead, do one mmu_update hypercall
directly.
During construction of the initial page tables, continue to use
native_set_pte() because most of the PTEs being set are in writable
and unpinned pages (see phys_pmd_init() in arch/x86/mm/init_64.c) and
using a hypercall for this is very expensive.
This significantly improves page fault performance in 32-bit PV
guests.
lmbench3 test Before After Improvement
----------------------------------------------
lat_pagefault 3.18 us 2.32 us 27%
lat_proc fork 356 us 313.3 us 11%
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>