Bastian Blank [Sat, 10 Dec 2011 18:29:48 +0000 (19:29 +0100)]
xen: Add xenbus_backend device
Access for xenstored to the event channel and pre-allocated ring is
managed via xenfs. This adds its own character device featuring mmap
for the ring and an ioctl for the event channel.
Signed-off-by: Bastian Blank <waldi@debian.org> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Bastian Blank [Fri, 16 Dec 2011 16:34:33 +0000 (11:34 -0500)]
xen: Add privcmd device driver
Access to arbitrary hypercalls is currently provided via xenfs. This
adds a standard character device to handle this. The support in xenfs
remains for backward compatibility and uses the device driver code.
Signed-off-by: Bastian Blank <waldi@debian.org> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
Daniel De Graaf [Mon, 28 Nov 2011 16:49:11 +0000 (11:49 -0500)]
xen/gntalloc: fix reference counts on multi-page mappings
When a multi-page mapping of gntalloc is created, the reference counts
of all pages in the vma are incremented. However, the vma open/close
operations only adjusted the reference count of the first page in the
mapping, leaking the other pages. Store a struct in the vm_private_data
to track the original page count to properly free the pages when the
last reference to the vma is closed.
Reported-by: Anil Madhavapeddy <anil@recoil.org> Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Daniel De Graaf [Mon, 28 Nov 2011 16:49:10 +0000 (11:49 -0500)]
xen/gntalloc: release grant references on page free
gnttab_end_foreign_access_ref does not return the grant reference it is
passed to the free list; gnttab_free_grant_reference needs to be
explicitly called. While gnttab_end_foreign_access provides a wrapper
for this, it is unsuitable because it does not return errors.
Reported-by: Anil Madhavapeddy <anil@recoil.org> Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Daniel De Graaf [Mon, 28 Nov 2011 16:49:09 +0000 (11:49 -0500)]
xen/events: prevent calling evtchn_get on invalid channels
The event channel number provided to evtchn_get can be provided by
userspace, so needs to be checked against the maximum number of event
channels prior to using it to index into evtchn_to_irq.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Annie Li [Mon, 12 Dec 2011 10:15:07 +0000 (18:15 +0800)]
xen/granttable: Support transitive grants
These allow a domain A which has been granted access on a page of domain B's
memory to issue domain C with a copy-grant on the same page. This is useful
e.g. for forwarding packets between domains.
Signed-off-by: Annie Li <annie.li@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Annie Li [Mon, 12 Dec 2011 10:14:42 +0000 (18:14 +0800)]
xen/granttable: Support sub-page grants
- They can't be used to map the page (so can only be used in a GNTTABOP_copy
hypercall).
- It's possible to grant access with a finer granularity than whole pages.
- Xen guarantees that they can be revoked quickly (a normal map grant can
only be revoked with the cooperation of the domain which has been granted
access).
Signed-off-by: Annie Li <annie.li@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Luck, Tony [Wed, 30 Nov 2011 18:22:37 +0000 (10:22 -0800)]
xen/ia64: fix build breakage because of conflicting u64 guest handles
include/xen/interface/xen.h:526: error: conflicting types for ‘__guest_handle_u64’
arch/ia64/include/asm/xen/interface.h:74: error: previous declaration of ‘__guest_handle_u64’ was here
Problem introduced by "xen/granttable: Introducing grant table V2 stucture"
which added a new definition to include/xen/interface/xen.h for "u64".
Fix: delete the ia64 arch specific definition.
Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Annie Li [Tue, 22 Nov 2011 01:59:56 +0000 (09:59 +0800)]
xen/granttable: Keep code format clean
Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Annie Li <annie.li@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Annie Li [Tue, 22 Nov 2011 01:59:21 +0000 (09:59 +0800)]
xen/granttable: Grant tables V2 implementation
Receiver-side copying of packets is based on this implementation, it gives
better performance and better CPU accounting. It totally supports three types:
full-page, sub-page and transitive grants.
However this patch does not cover sub-page and transitive grants, it mainly
focus on Full-page part and implements grant table V2 interfaces corresponding
to what already exists in grant table V1, such as: grant table V2
initialization, mapping, releasing and exported interfaces.
Each guest can only supports one type of grant table type, every entry in grant
table should be the same version. It is necessary to set V1 or V2 version before
initializing the grant table.
Grant table exported interfaces of V2 are same with those of V1, Xen is
responsible to judge what grant table version guests are using in every grant
operation.
V2 fulfills the same role of V1, and it is totally backwards compitable with V1.
If dom0 support grant table V2, the guests runing on it can run with either V1
or V2.
Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Annie Li <annie.li@oracle.com>
[v1: Modified alloc_vm_area call (new parameters), indentation, and cleanpatch
warnings] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Annie Li [Tue, 22 Nov 2011 01:58:47 +0000 (09:58 +0800)]
xen/granttable: Refactor some code
Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Annie Li <annie.li@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Annie Li [Tue, 22 Nov 2011 01:58:06 +0000 (09:58 +0800)]
xen/granttable: Introducing grant table V2 stucture
This patch introduces new structures of grant table V2, grant table V2 is an
extension from V1. Grant table is shared between guest and Xen, and Xen is
responsible to do corresponding work for grant operations, such as: figure
out guest's grant table version, perform different actions based on
different grant table version, etc. Although full-page structure of V2
is different from V1, it play the same role as V1.
Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Annie Li <annie.li@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Jeremy Fitzhardinge [Fri, 18 Nov 2011 23:56:06 +0000 (15:56 -0800)]
Xen: update MAINTAINER info
No longer at Citrix, still interested in Xen.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Daniel De Graaf [Thu, 27 Oct 2011 21:58:47 +0000 (17:58 -0400)]
xen/event: Add reference counting to event channels
Event channels exposed to userspace by the evtchn module may be used by
other modules in an asynchronous manner, which requires that reference
counting be used to prevent the event channel from being closed before
the signals are delivered.
The reference count on new event channels defaults to -1 which indicates
the event channel is not referenced outside the kernel; evtchn_get fails
if called on such an event channel. The event channels made visible to
userspace by evtchn have a normal reference count.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Thu, 2 Feb 2012 18:58:45 +0000 (13:58 -0500)]
Merge branch 'in-3.1/bug.fixes' into stable/for-linus-3.3.rebased
* in-3.1/bug.fixes:
x86/paravirt: PTE updates in k(un)map_atomic need to be synchronous, regardless of lazy_mmu mode.
xen/i386: follow-up to "replace order-based range checking of M2P table by linear one"
xen/irq: Alter the locking to use a mutex instead of a spinlock.
xen/e820: if there is no dom0_mem=, don't tweak extra_pages.
Revert "xen/e820: if there is no dom0_mem=, don't tweak extra_pages."
xen/e820: if there is no dom0_mem=, don't tweak extra_pages.
xen: disable PV spinlocks on HVM
xen/smp: Warn user why they keel over - nosmp or noapic and what to use instead.
xen: x86_32: do not enable iterrupts when returning from exception in interrupt context
xen: use maximum reservation to limit amount of usable RAM
xen: Do not enable PV IPIs when vector callback not present
xen/x86: replace order-based range checking of M2P table by linear one
xen: Fix misleading WARN message at xen_release_chunk
xen: Fix printk() format in xen/setup.c
xen/grant: Fix compile warning.
xen:pvhvm: Modpost section mismatch fix
Daniel De Graaf [Thu, 27 Oct 2011 21:58:49 +0000 (17:58 -0400)]
xen/gnt{dev,alloc}: reserve event channels for notify
When using the unmap notify ioctl, the event channel used for
notification needs to be reserved to avoid it being deallocated prior to
sending the notification.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
Daniel De Graaf [Thu, 27 Oct 2011 21:58:48 +0000 (17:58 -0400)]
xen/gntalloc: Change gref_lock to a mutex
The event channel release function cannot be called under a spinlock
because it can attempt to acquire a mutex due to the event channel
reference acquired when setting up unmap notifications.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Dan Carpenter [Fri, 4 Nov 2011 18:24:36 +0000 (21:24 +0300)]
xen-gntalloc: signedness bug in add_grefs()
gref->gref_id is unsigned so the error handling didn't work.
gnttab_grant_foreign_access() returns an int type, so we can add a
cast here, and it doesn't cause any problems.
gnttab_grant_foreign_access() can return a variety of errors
including -ENOSPC, -ENOSYS and -ENOMEM.
CC: stable@kernel.org Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Dan Carpenter [Fri, 4 Nov 2011 18:24:08 +0000 (21:24 +0300)]
xen-gntalloc: integer overflow in gntalloc_ioctl_alloc()
On 32 bit systems a high value of op.count could lead to an integer
overflow in the kzalloc() and gref_ids would be smaller than
expected. If the you triggered another integer overflow in
"if (gref_size + op.count > limit)" then you'd probably get memory
corruption inside add_grefs().
CC: stable@kernel.org Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Daniel De Graaf [Wed, 19 Oct 2011 21:59:37 +0000 (17:59 -0400)]
xen/balloon: Avoid OOM when requesting highmem
If highmem pages are requested from the balloon on a system without
highmem, the implementation of alloc_xenballooned_pages will allocate
all available memory trying to find highmem pages to return. Allow
low memory to be returned when highmem pages are requested to avoid
this loop.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Daniel De Graaf [Tue, 11 Oct 2011 19:16:06 +0000 (15:16 -0400)]
xen/gntdev: Fix sleep-inside-spinlock
BUG: sleeping function called from invalid context at /local/scratch/dariof/linux/kernel/mutex.c:271
in_atomic(): 1, irqs_disabled(): 0, pid: 3256, name: qemu-dm
1 lock held by qemu-dm/3256:
#0: (&(&priv->lock)->rlock){......}, at: [<ffffffff813223da>] gntdev_ioctl+0x2bd/0x4d5
Pid: 3256, comm: qemu-dm Tainted: G W 3.1.0-rc8+ #5
Call Trace:
[<ffffffff81054594>] __might_sleep+0x131/0x135
[<ffffffff816bd64f>] mutex_lock_nested+0x25/0x45
[<ffffffff8131c7c8>] free_xenballooned_pages+0x20/0xb1
[<ffffffff8132194d>] gntdev_put_map+0xa8/0xdb
[<ffffffff816be546>] ? _raw_spin_lock+0x71/0x7a
[<ffffffff813223da>] ? gntdev_ioctl+0x2bd/0x4d5
[<ffffffff8132243c>] gntdev_ioctl+0x31f/0x4d5
[<ffffffff81007d62>] ? check_events+0x12/0x20
[<ffffffff811433bc>] do_vfs_ioctl+0x488/0x4d7
[<ffffffff81007d4f>] ? xen_restore_fl_direct_reloc+0x4/0x4
[<ffffffff8109168b>] ? lock_release+0x21c/0x229
[<ffffffff81135cdd>] ? rcu_read_unlock+0x21/0x32
[<ffffffff81143452>] sys_ioctl+0x47/0x6a
[<ffffffff816bfd82>] system_call_fastpath+0x16/0x1b
gntdev_put_map tries to acquire a mutex when freeing pages back to the
xenballoon pool, so it cannot be called with a spinlock held. In
gntdev_release, the spinlock is not needed as we are freeing the
structure later; in the ioctl, only the list manipulation needs to be
under the lock.
Reported-and-Tested-By: Dario Faggioli <dario.faggioli@citrix.com> Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen: modify kernel mappings corresponding to granted pages
If we want to use granted pages for AIO, changing the mappings of a user
vma and the corresponding p2m is not enough, we also need to update the
kernel mappings accordingly.
Currently this is only needed for pages that are created for user usages
through /dev/xen/gntdev. As in, pages that have been in use by the
kernel and use the P2M will not need this special mapping.
However there are no guarantees that in the future the kernel won't
start accessing pages through the 1:1 even for internal usage.
In order to avoid the complexity of dealing with highmem, we allocated
the pages lowmem.
We issue a HYPERVISOR_grant_table_op right away in
m2p_add_override and we remove the mappings using another
HYPERVISOR_grant_table_op in m2p_remove_override.
Considering that m2p_add_override and m2p_remove_override are called
once per page we use multicalls and hypercall batching.
Use the kmap_op pointer directly as argument to do the mapping as it is
guaranteed to be present up until the unmapping is done.
Before issuing any unmapping multicalls, we need to make sure that the
mapping has already being done, because we need the kmap->handle to be
set correctly.
We dropped a lot of the MMU debugfs in favour of using
tracing API - but there is one which just provides
mostly static information that was made invisible by this change.
Bring it back.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The problem is that in copy_page_range we turn lazy mode on, and then
in swap_entry_free we call swap_count_continued which ends up in:
map = kmap_atomic(page, KM_USER0) + offset;
and then later we touch *map.
Since we are running in batched mode (lazy) we don't actually set up the
PTE mappings and the kmap_atomic is not done synchronously and ends up
trying to dereference a page that has not been set.
Looking at kmap_atomic_prot_pfn, it uses 'arch_flush_lazy_mmu_mode' and
doing the same in kmap_atomic_prot and __kunmap_atomic makes the problem
go away.
Interestingly, git commit b8bcfe997e46150fedcc3f5b26b846400122fdd9
removed part of this to fix an interrupt issue - but it went to far
and did not consider this scenario.
CC: Thomas Gleixner <tglx@linutronix.de> CC: Ingo Molnar <mingo@redhat.com> CC: "H. Peter Anvin" <hpa@zytor.com> CC: x86@kernel.org CC: Peter Zijlstra <a.p.zijlstra@chello.nl> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> CC: stable@kernel.org
[v1: Redid the commit description per Jeremy's apt suggestion] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Jan Beulich [Thu, 15 Sep 2011 07:52:40 +0000 (08:52 +0100)]
xen/i386: follow-up to "replace order-based range checking of M2P table by linear one"
The numbers obtained from the hypervisor really can't ever lead to an
overflow here, only the original calculation going through the order
of the range could have. This avoids the (as Jeremy points outs)
somewhat ugly NULL-based calculation here.
Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen/irq: Alter the locking to use a mutex instead of a spinlock.
When we allocate/change the IRQ informations, we do not
need to use spinlocks. We can use a mutex (which is
what the generic IRQ code does for allocations/changes).
Fixes a slew of:
Reported-by: Jim Burns <jim_burn@bellsouth.net> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
David Vrabel [Tue, 13 Sep 2011 14:17:32 +0000 (10:17 -0400)]
xen/e820: if there is no dom0_mem=, don't tweak extra_pages.
The patch "xen: use maximum reservation to limit amount of usable RAM"
(d312ae878b6aed3912e1acaaf5d0b2a9d08a4f11) breaks machines that
do not use 'dom0_mem=' argument with:
reserve RAM buffer: 000000133f2e2000 - 000000133fffffff
(XEN) mm.c:4976:d0 Global bit is set to kernel page fffff8117e
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
...
The reason being that the last E820 entry is created using the
'extra_pages' (which is based on how many pages have been freed).
The mentioned git commit sets the initial value of 'extra_pages'
using a hypercall which returns the number of pages (if dom0_mem
has been used) or -1 otherwise. If the later we return with
MAX_DOMAIN_PAGES as basis for calculation:
which means we end up with extra_pages = 128GB in PFNs (33554432)
- 8GB in PFNs (2097152, on this specific box, can be larger or smaller),
and then we add that value to the E820 making it:
xen/e820: if there is no dom0_mem=, don't tweak extra_pages.
The patch "xen: use maximum reservation to limit amount of usable RAM"
(d312ae878b6aed3912e1acaaf5d0b2a9d08a4f11) breaks machines that
do not use 'dom0_mem=' argument with:
reserve RAM buffer: 000000133f2e2000 - 000000133fffffff
(XEN) mm.c:4976:d0 Global bit is set to kernel page fffff8117e
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
...
The reason being that the last E820 entry is created using the
'extra_pages' (which is based on how many pages have been freed).
The mentioned git commit sets the initial value of 'extra_pages'
using a hypercall which returns the number of pages (if dom0_mem
has been used) or -1 otherwise. If the later we return with
MAX_DOMAIN_PAGES as basis for calculation:
which means we end up with extra_pages = 128GB in PFNs (33554432)
- 8GB in PFNs (2097152, on this specific box, can be larger or smaller),
and then we add that value to the E820 making it:
PV spinlocks cannot possibly work with the current code because they are
enabled after pvops patching has already been done, and because PV
spinlocks use a different data structure than native spinlocks so we
cannot switch between them dynamically. A spinlock that has been taken
once by the native code (__ticket_spin_lock) cannot be taken by
__xen_spin_lock even after it has been released.
Reported-and-Tested-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen/smp: Warn user why they keel over - nosmp or noapic and what to use instead.
We have hit a couple of customer bugs where they would like to
use those parameters to run an UP kernel - but both of those
options turn of important sources of interrupt information so
we end up not being able to boot. The correct way is to
pass in 'dom0_max_vcpus=1' on the Xen hypervisor line and
the kernel will patch itself to be a UP kernel.
Igor Mammedov [Thu, 1 Sep 2011 11:46:55 +0000 (13:46 +0200)]
xen: x86_32: do not enable iterrupts when returning from exception in interrupt context
If vmalloc page_fault happens inside of interrupt handler with interrupts
disabled then on exit path from exception handler when there is no pending
interrupts, the following code (arch/x86/xen/xen-asm_32.S:112):
Solution is in setting XEN_vcpu_info_mask only when it should be set
according to
cmpw $0x0001, XEN_vcpu_info_pending(%eax)
but not clearing it if there isn't any pending events.
Reproducer for bug is attached to RHBZ 707552
CC: stable@kernel.org Signed-off-by: Igor Mammedov <imammedo@redhat.com> Acked-by: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
David Vrabel [Fri, 19 Aug 2011 14:57:16 +0000 (15:57 +0100)]
xen: use maximum reservation to limit amount of usable RAM
Use the domain's maximum reservation to limit the amount of extra RAM
for the memory balloon. This reduces the size of the pages tables and
the amount of reserved low memory (which defaults to about 1/32 of the
total RAM).
On a system with 8 GiB of RAM with the domain limited to 1 GiB the
kernel reports:
Before:
Memory: 627792k/4472000k available
After:
Memory: 549740k/11132224k available
A increase of about 76 MiB (~1.5% of the unused 7 GiB). The reserved
low memory is also reduced from 253 MiB to 32 MiB. The total
additional usable RAM is 329 MiB.
For dom0, this requires at patch to Xen ('x86: use 'dom0_mem' to limit
the number of pages for dom0') (c/s 23790)
CC: stable@kernel.org Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This change replaced the SMP operations with event based handlers without
taking into account that this only works when the hypervisor supports
callback vectors. This causes unexplainable hangs early on boot for
HVM guests with more than one CPU.
BugLink: http://bugs.launchpad.net/bugs/791850 CC: stable@kernel.org Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Tested-and-Reported-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
whereas a direct check requires just a compare, like in
cmp ebx, [machine_to_phys_nr]
jae ...
), but also slightly dangerous in the 32-on-64 case - the element
address calculation can wrap if the next power of two boundary is
sufficiently far away from the actual upper limit of the table, and
hence can result in user space addresses being accessed (with it being
unknown what may actually be mapped there).
Additionally, the elimination of the mistaken use of fls() here (should
have been __fls()) fixes a latent issue on x86-64 that would trigger
if the code was run on a system with memory extending beyond the 44-bit
boundary.
CC: stable@kernel.org Signed-off-by: Jan Beulich <jbeulich@novell.com>
[v1: Based on Jeremy's feedback] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Igor Mammedov [Tue, 2 Aug 2011 09:45:25 +0000 (11:45 +0200)]
xen: Fix misleading WARN message at xen_release_chunk
WARN message should not complain
"Failed to release memory %lx-%lx err=%d\n"
^^^^^^^
about range when it fails to release just one page,
instead it should say what pfn is not freed.
In addition line:
printk(KERN_INFO "xen_release_chunk: looking at area pfn %lx-%lx: "
...
printk(KERN_CONT "%lu pages freed\n", len);
will be broken if WARN in between this line is fired. So fix it
by using a single printk for this.
Signed-off-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Thu, 4 Aug 2011 22:42:10 +0000 (18:42 -0400)]
xen/trace: Fix compile error when CONFIG_XEN_PRIVILEGED_GUEST is not set
with CONFIG_XEN and CONFIG_FTRACE set we get this:
arch/x86/xen/trace.c:22: error: ‘__HYPERVISOR_console_io’ undeclared here (not in a function)
arch/x86/xen/trace.c:22: error: array index in initializer not of integer type
arch/x86/xen/trace.c:22: error: (near initialization for ‘xen_hypercall_names’)
arch/x86/xen/trace.c:23: error: ‘__HYPERVISOR_physdev_op_compat’ undeclared here (not in a function)
Issue was that the definitions of __HYPERVISOR were not pulled
if CONFIG_XEN_PRIVILEGED_GUEST was not set.
Reported-by: Randy Dunlap <rdunlap@xenotime.net> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Jeremy Fitzhardinge [Mon, 25 Jul 2011 22:51:02 +0000 (15:51 -0700)]
xen/tracing: fix compile errors when tracing is disabled.
When CONFIG_FUNCTION_TRACER is disabled, compilation fails as follows:
CC arch/x86/xen/setup.o
In file included from arch/x86/include/asm/xen/hypercall.h:42,
from arch/x86/xen/setup.c:19:
include/trace/events/xen.h:31: warning: 'struct multicall_entry' declared inside parameter list
include/trace/events/xen.h:31: warning: its scope is only this definition or declaration, which is probably not what you want
include/trace/events/xen.h:31: warning: 'struct multicall_entry' declared inside parameter list
include/trace/events/xen.h:31: warning: 'struct multicall_entry' declared inside parameter list
include/trace/events/xen.h:31: warning: 'struct multicall_entry' declared inside parameter list
[...]
arch/x86/xen/trace.c:5: error: '__HYPERVISOR_set_trap_table' undeclared here (not in a function)
arch/x86/xen/trace.c:5: error: array index in initializer not of integer type
arch/x86/xen/trace.c:5: error: (near initialization for 'xen_hypercall_names')
arch/x86/xen/trace.c:6: error: '__HYPERVISOR_mmu_update' undeclared here (not in a function)
arch/x86/xen/trace.c:6: error: array index in initializer not of integer type
arch/x86/xen/trace.c:6: error: (near initialization for 'xen_hypercall_names')
Fix this by making sure struct multicall_entry has a declaration in
scope at all times, and don't bother compiling xen/trace.c when tracing
is disabled.
Reported-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Jeremy Fitzhardinge [Fri, 17 Dec 2010 22:58:43 +0000 (14:58 -0800)]
xen/mmu: tune pgtable alloc/release
Make sure the fastpath code is inlined. Batch the page permission change
and the pin/unpin, and make sure that it can be batched with any
adjacent set_pte/pmd/etc operations.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Removing __init from check_platform_magic since it is called by
xen_unplug_emulated_devices in non-init contexts (It probably gets inlined
because of -finline-functions-called-once, removing __init is more to avoid
mismatch being reported).
Signed-off-by: Raghavendra D Prabhu <rprabhu@wnohang.net> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Make Dell Latitude E6420 use reboot=pci
x86: Make Dell Latitude E5420 use reboot=pci
H. Peter Anvin [Thu, 21 Jul 2011 18:22:21 +0000 (11:22 -0700)]
x86: Make Dell Latitude E6420 use reboot=pci
Yet another variant of the Dell Latitude series which requires
reboot=pci.
From the E5420 bug report by Daniel J Blueman:
> The E6420 is affected also (same platform, different casing and
> features), which provides an external confirmation of the issue; I can
> submit a patch for that later or include it if you prefer:
> http://linux.koolsolutions.com/2009/08/04/howto-fix-linux-hangfreeze-during-reboots-and-restarts/
Reported-by: Daniel J Blueman <daniel.blueman@gmail.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: <stable@kernel.org>
Daniel J Blueman [Fri, 13 May 2011 01:04:59 +0000 (09:04 +0800)]
x86: Make Dell Latitude E5420 use reboot=pci
Rebooting on the Dell E5420 often hangs with the keyboard or ACPI
methods, but is reliable via the PCI method.
[ hpa: this was deferred because we believed for a long time that the
recent reshuffling of the boot priorities in commit 660e34cebf0a11d54f2d5dd8838607452355f321 fixed this platform.
Unfortunately that turned out to be incorrect. ]
vfs: drop conditional inode prefetch in __do_lookup_rcu
It seems to hurt performance in real life. Yes, the inode will be used
later, but the conditional doesn't seem to predict all that well
(negative dentries are not uncommon) and it looks like the cost of
prefetching is simply higher than depending on the cache doing the right
thing.
The compiler, at least for ix86 and m68k, validly warns that the
comparison:
next <= (loff_t)-1
is always true (and it's always true also for x86-64 and probably all
other arches - as long as pgoff_t isn't wider than loff_t). The
intention appears to be to avoid wrapping of "next", so rather than
eliminating the pointless comparison, fix the loop to indeed get exited
when "next" would otherwise wrap.
On m68k the following warning is observed:
fs/fscache/page.c: In function '__fscache_uncache_all_inode_pages':
fs/fscache/page.c:979: warning: comparison is always false due to limited range of data type
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Reported-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Suresh Jayaraman <sjayaraman@suse.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Avoid creating superfluous NUMA domains on non-NUMA systems
sched: Allow for overlapping sched_domain spans
sched: Break out cpu_power from the sched_group structure
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86. reboot: Make Dell Latitude E6320 use reboot=pci
x86, doc only: Correct real-mode kernel header offset for init_size
x86: Disable AMD_NUMA for 32bit for now
Paul E. McKenney [Tue, 19 Jul 2011 10:25:36 +0000 (03:25 -0700)]
signal: align __lock_task_sighand() irq disabling and RCU
The __lock_task_sighand() function calls rcu_read_lock() with interrupts
and preemption enabled, but later calls rcu_read_unlock() with interrupts
disabled. It is therefore possible that this RCU read-side critical
section will be preempted and later RCU priority boosted, which means that
rcu_read_unlock() will call rt_mutex_unlock() in order to deboost itself, but
with interrupts disabled. This results in lockdep splats, so this commit
nests the RCU read-side critical section within the interrupt-disabled
region of code. This prevents the RCU read-side critical section from
being preempted, and thus prevents the attempt to deboost with interrupts
disabled.
It is quite possible that a better long-term fix is to make rt_mutex_unlock()
disable irqs when acquiring the rt_mutex structure's ->wait_lock.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Peter Zijlstra [Tue, 19 Jul 2011 22:32:00 +0000 (15:32 -0700)]
softirq,rcu: Inform RCU of irq_exit() activity
The rcu_read_unlock_special() function relies on in_irq() to exclude
scheduler activity from interrupt level. This fails because exit_irq()
can invoke the scheduler after clearing the preempt_count() bits that
in_irq() uses to determine that it is at interrupt level. This situation
can result in failures as follows:
$task IRQ SoftIRQ
rcu_read_lock()
/* do stuff */
<preempt> |= UNLOCK_BLOCKED
rcu_read_unlock()
--t->rcu_read_lock_nesting
irq_enter();
/* do stuff, don't use RCU */
irq_exit();
sub_preempt_count(IRQ_EXIT_OFFSET);
invoke_softirq()
Ed can simply trigger this 'easy' because invoke_softirq() immediately
does a ttwu() of ksoftirqd/# instead of doing the in-place softirq stuff
first, but even without that the above happens.
Cure this by also excluding softirqs from the
rcu_read_unlock_special() handler and ensuring the force_irqthreads
ksoftirqd/# wakeup is done from full softirq context.
[ Alternatively, delaying the ->rcu_read_lock_nesting decrement
until after the special handling would make the thing more robust
in the face of interrupts as well. And there is a separate patch
for that. ]
Cc: Thomas Gleixner <tglx@linutronix.de> Reported-and-tested-by: Ed Tomlinson <edt@aei.ca> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Peter Zijlstra [Tue, 19 Jul 2011 22:07:25 +0000 (15:07 -0700)]
sched: Add irq_{enter,exit}() to scheduler_ipi()
Ensure scheduler_ipi() calls irq_{enter,exit} when it does some actual
work. Traditionally we never did any actual work from the resched IPI
and all magic happened in the return from interrupt path.
Now that we do do some work, we need to ensure irq_{enter,exit} are
called so that we don't confuse things.
This affects things like timekeeping, NO_HZ and RCU, basically
everything with a hook in irq_enter/exit.
Explicit examples of things going wrong are:
sched_clock_cpu() -- has a callback when leaving NO_HZ state to take
a new reading from GTOD and TSC. Without this
callback, time is stuck in the past.
RCU -- needs in_irq() to work in order to avoid some nasty deadlocks
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Paul E. McKenney [Mon, 18 Jul 2011 04:14:35 +0000 (21:14 -0700)]
rcu: protect __rcu_read_unlock() against scheduler-using irq handlers
The addition of RCU read-side critical sections within runqueue and
priority-inheritance lock critical sections introduced some deadlock
cycles, for example, involving interrupts from __rcu_read_unlock()
where the interrupt handlers call wake_up(). This situation can cause
the instance of __rcu_read_unlock() invoked from interrupt to do some
of the processing that would otherwise have been carried out by the
task-level instance of __rcu_read_unlock(). When the interrupt-level
instance of __rcu_read_unlock() is called with a scheduler lock held
from interrupt-entry/exit situations where in_irq() returns false,
deadlock can result.
This commit resolves these deadlocks by using negative values of
the per-task ->rcu_read_lock_nesting counter to indicate that an
instance of __rcu_read_unlock() is in flight, which in turn prevents
instances from interrupt handlers from doing any special processing.
This patch is inspired by Steven Rostedt's earlier patch that similarly
made __rcu_read_unlock() guard against interrupt-mediated recursion
(see https://lkml.org/lkml/2011/7/15/326), but this commit refines
Steven's approach to avoid the need for preemption disabling on the
__rcu_read_unlock() fastpath and to also avoid the need for manipulating
a separate per-CPU variable.
This patch avoids need for preempt_disable() by instead using negative
values of the per-task ->rcu_read_lock_nesting counter. Note that nested
rcu_read_lock()/rcu_read_unlock() pairs are still permitted, but they will
never see ->rcu_read_lock_nesting go to zero, and will therefore never
invoke rcu_read_unlock_special(), thus preventing them from seeing the
RCU_READ_UNLOCK_BLOCKED bit should it be set in ->rcu_read_unlock_special.
This patch also adds a check for ->rcu_read_unlock_special being negative
in rcu_check_callbacks(), thus preventing the RCU_READ_UNLOCK_NEED_QS
bit from being set should a scheduling-clock interrupt occur while
__rcu_read_unlock() is exiting from an outermost RCU read-side critical
section.
Of course, __rcu_read_unlock() can be preempted during the time that
->rcu_read_lock_nesting is negative. This could result in the setting
of the RCU_READ_UNLOCK_BLOCKED bit after __rcu_read_unlock() checks it,
and would also result it this task being queued on the corresponding
rcu_node structure's blkd_tasks list. Therefore, some later RCU read-side
critical section would enter rcu_read_unlock_special() to clean up --
which could result in deadlock if that critical section happened to be in
the scheduler where the runqueue or priority-inheritance locks were held.
This situation is dealt with by making rcu_preempt_note_context_switch()
check for negative ->rcu_read_lock_nesting, thus refraining from
queuing the task (and from setting RCU_READ_UNLOCK_BLOCKED) if we are
already exiting from the outermost RCU read-side critical section (in
other words, we really are no longer actually in that RCU read-side
critical section). In addition, rcu_preempt_note_context_switch()
invokes rcu_read_unlock_special() to carry out the cleanup in this case,
which clears out the ->rcu_read_unlock_special bits and dequeues the task
(if necessary), in turn avoiding needless delay of the current RCU grace
period and needless RCU priority boosting.
It is still illegal to call rcu_read_unlock() while holding a scheduler
lock if the prior RCU read-side critical section has ever had either
preemption or irqs enabled. However, the common use case is legal,
namely where then entire RCU read-side critical section executes with
irqs disabled, for example, when the scheduler lock is held across the
entire lifetime of the RCU read-side critical section.
Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Peter Zijlstra [Wed, 20 Jul 2011 16:42:57 +0000 (18:42 +0200)]
sched: Avoid creating superfluous NUMA domains on non-NUMA systems
When creating sched_domains, stop when we've covered the entire
target span instead of continuing to create domains, only to
later find they're redundant and throw them away again.
This avoids single node systems from touching funny NUMA
sched_domain creation code and reduces the risks of the new
SD_OVERLAP code.
Peter Zijlstra [Fri, 15 Jul 2011 08:35:52 +0000 (10:35 +0200)]
sched: Allow for overlapping sched_domain spans
Allow for sched_domain spans that overlap by giving such domains their
own sched_group list instead of sharing the sched_groups amongst
each-other.
This is needed for machines with more than 16 nodes, because
sched_domain_node_span() will generate a node mask from the
16 nearest nodes without regard if these masks have any overlap.
Currently sched_domains have a sched_group that maps to their child
sched_domain span, and since there is no overlap we share the
sched_group between the sched_domains of the various CPUs. If however
there is overlap, we would need to link the sched_group list in
different ways for each cpu, and hence sharing isn't possible.
In order to solve this, allocate private sched_groups for each CPU's
sched_domain but have the sched_groups share a sched_group_power
structure such that we can uniquely track the power.
Reported-and-tested-by: Anton Blanchard <anton@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-08bxqw9wis3qti9u5inifh3y@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>