www.infradead.org Git - users/jedix/linux-maple.git/log

Merge branch 'stable/misc' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into uek2-merge

Which adds the microcode code support. It is not upstream
and probably won't be as the upstream as the x86 maintainers want to
load the microcode blob (in a new format) as part of the GRUB loader:
[http://lists.xen.org/archives/html/xen-devel/2011-12/msg00250.html]

Jan Beulich implemented a patchset for Xen hypervisor which would do this
as part of the mboot loader and define which payload using 'ucode=<number>'.
[http://lists.xen.org/archives/html/xen-devel/2011-12/msg00007.html]
but that is not what the x86 maintainers want to do (as he did not define
a new format and just ingested the raw binary blob). There is also
a feature: "[PATCH] x86/microcode: Allow "ucode=" argument to be negative"
which will pick the microcode as the last payload.

For the time being lets use this old driver that loads the microcode
in the dom0 and pushes it up to the hypervisor - and let the x86 and xen
folks sort this out.

* 'stable/misc' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
  x86/microcode: check proper return code.
  xen/v86d: Fix /dev/mem to access memory below 1MB
  xen: add CPU microcode update driver
  xen: add dom0_op hypercall
  xen/acpi: Domain0 acpi parser related platform hypercall

Conflicts:
arch/x86/xen/Kconfig

Merge branch 'stable/bug.fixes-3.3.rebased' into uek2-merge

* stable/bug.fixes-3.3.rebased:
  x86/paravirt: Use pte_val instead of pte_flags on CPA pageattr_test
  x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations.
  xen/pm_idle: Make pm_idle be default_idle under Xen.

Merge branches 'stable/xen-block.rebase' and 'stable/vmalloc-3.2.rebased' into uek2-merge

* stable/xen-block.rebase:
  xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.
  block: xen-blkback: use API provided by xenbus module to map rings
  xen-blkback: convert hole punching to discard request on loop devices
  xen/blkback: Move processing of BLKIF_OP_DISCARD from dispatch_rw_block_io
  xen/blk[front|back]: Enhance discard support with secure erasing support.
  xen/blk[front|back]: Squash blkif_request_rw and blkif_request_discard together

* stable/vmalloc-3.2.rebased:
  xen: map foreign pages for shared rings by updating the PTEs directly
  net: xen-netback: use API provided by xenbus module to map rings
  block: xen-blkback: use API provided by xenbus module to map rings
  xen: use generic functions instead of xen_{alloc, free}_vm_area()

xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.

When do block-attach/block-detach test with below steps, umount hangs
in the guest. Furthermore shutdown ends up being stuck when umounting file-systems.

1. start guest.
2. attach new block device by xm block-attach in Dom0.
3. mount new disk in guest.
4. execute xm block-detach to detach the block device in dom0 until timeout
5. Any request to the disk will hung.

Root cause:
This issue is caused when setting backend device's state to
'XenbusStateClosing', which sends to the frontend the XenbusStateClosing
notification. When frontend receives the notification it tries to release
the disk in blkfront_closing(), but at that moment the disk is still in use
by guest, so frontend refuses to close. Specifically it sets the disk state to
XenbusStateClosing and sends the notification to backend - when backend receives the
event, it disconnects the vbd from real device, and sets the vbd device state to
XenbusStateClosing. The backend disconnects the real device/file, and any IO
requests to the disk in guest will end up in ether, leaving disk DEAD and set to
XenbusStateClosing. When the guest wants to disconnect the disk, umount will
hang on blkif_release()->xlvbd_release_gendisk() as it is unable to send any IO
to the disk, which prevents clean system shutdown.

Solution:
Don't disconnect backend until frontend state switched to XenbusStateClosed.

Signed-off-by: Joe Jin <joe.jin@oracle.com>
Cc: Daniel Stodden <daniel.stodden@citrix.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Annie Li <annie.li@oracle.com>
Cc: Ian Campbell <Ian.Campbell@eu.citrix.com>
[v1: Modified description a bit]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/paravirt: Use pte_val instead of pte_flags on CPA pageattr_test

For details refer to patch "x86/paravirt: Use pte_attrs instead of
pte_flags on CPA/set_p.._wb/wc operations." which explains that
some pages have the _PAGE_PWT bit set in the _PAGE_PSE field
when running under Xen.

When pageattr_test is running it uses pte_flags to check whether
it succedded in setting _PAGE_UNUSED1 bit, but also whether the
page had _PAGE_PSE. This can happen when one of the randomly selected
pages to be tested is a page that has been set to be _PAGE_WC
as under Xen, that field is under _PAGE_PSE. Since the 'pte_huge'
call is using the pte_flags(x) macro, which extracts the "raw" contents
of the PTE, the translation of _PAGE_PSE -> _PAGE_PWT does not happen
and we incorrectly identify the PTE as bad.

Using the 'pte_val' instead of 'pte_flags' fixes the problem and
this patch does that.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
CC: stable@kernel.org

x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations.

When using the paravirt interface, most of the page operations are wrapped
in the pvops interface. The one that is not is the pte_flags. The reason
being that for most cases, the "raw" PTE flag values for baremetal and whatever
pvops platform is running (in this case) - share the same bit meaning.

Except for PAT. Under Linux, the PAT MSR is written to be:

          PAT4                 PAT0
+---+----+----+----+-----+----+----+
WC | WC | WB | UC | UC- | WC | WB |  <= Linux
+---+----+----+----+-----+----+----+
WC | WT | WB | UC | UC- | WT | WB |  <= BIOS
+---+----+----+----+-----+----+----+
WC | WP | WC | UC | UC- | WT | WB |  <= Xen
+---+----+----+----+-----+----+----+

The lookup of this index table translates to looking up
Bit 7, Bit 4, and Bit 3 of PTE:

PAT/PSE (bit 7) ... PCD (bit 4) .. PWT (bit 3).

If all bits are off, then we are using PAT0. If bit 3 turned on,
then we are using PAT1, if bit 3 and bit 4, then PAT2..

Back to the PAT MSR table:

As you can see, the PAT1 translates to PAT4 under Xen. Under Linux
we only use PAT0, PAT1, and PAT2 for the caching as:

WB = none (so PAT0)
WC = PWT (bit 3 on)
UC = PWT | PCD (bit 3 and 4 are on).

But to make it work with Xen, we end up doing for WC a translation:

PWT (so bit 3 on) --> PAT (so bit 7 is on) and clear bit 3

And to translate back (when the paravirt pte_val is used) we would:

PAT (bit 7 on) --> PWT (bit 3 on) and clear bit 7.

This works quite well, except if code uses the pte_flags, as pte_flags
reads the raw value and does not go through the paravirt. Which means
that if (when running under Xen):

1) we allocate some pages.
2) call set_pages_array_wc, which ends up calling:
     __page_change_att_set_clr(.., __pgprot(__PAGE_WC),  /* set */
                                 , __pgprot(__PAGE_MASK), /* clear */
    which ends up reading the _raw_ PTE flags and _only_ look at the
    _PTE_FLAG_MASK contents with __PAGE_MASK cleared (0x18) and
    __PAGE_WC (0x8) set.

     read raw *pte -> 0x67
     *pte = 0x67 & ^0x18 | 0x8
     *pte = 0x67 & 0xfffffe7 | 0x8
     *pte = 0x6f

   [now set_pte_atomic is called, and 0x6f is written in, but under
    xen_make_pte, the bit 3 is translated to bit 7, so it ends up
    writting 0xa7, which is correct]

3) do something to them.
4) call set_pages_array_wb
     __page_change_att_set_clr(.., __pgprot(__PAGE_WB),  /* set */
                                 , __pgprot(__PAGE_MASK), /* clear */
    which ends up reading the _raw_ PTE and _only_ look at the
    _PTE_FLAG_MASK contents with _PAGE_MASK cleared (0x18) and
    __PAGE_WB (0x0) set:

     read raw *pte -> 0xa7
     *pte = 0xa7 & &0x18 | 0
     *pte = 0xa7 & 0xfffffe7 | 0
     *pte = 0xa7

   [we check whether the old PTE is different from the new one

    if (pte_val(old_pte) != pte_val(new_pte)) {
        set_pte_atomic(kpte, new_pte);
        ...

   and find out that 0xA7 == 0xA7 so we do not write the new PTE value in]

   End result is that we failed at removing the WC caching bit!

5) free them.
   [and have pages with PAT4 (bit 7) set, so other subsystems end up using
    the pages that have the write combined bit set resulting in crashes. Yikes!].

The fix, which this patch proposes, is to wrap the pte_pgprot in the CPA
code with newly introduced pte_attrs which can go through the pvops interface
to get the "emulated" value instead of the raw. Naturally if CONFIG_PARAVIRT is
not set, it would end calling native_pte_val.

The other way to fix this is by wrapping pte_flags and go through the pvops
interface and it really is the Right Thing to do.  The problem is, that past
experience with mprotect stuff demonstrates that it be really expensive in inner
loops, and pte_flags() is used in some very perf-critical areas.

Example code to run this and see the various mysterious subsystems/applications
crashing

MODULE_AUTHOR("Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>");
MODULE_DESCRIPTION("wb_to_wc_and_back");
MODULE_LICENSE("GPL");
MODULE_VERSION(WB_TO_WC);

static int thread(void *arg)
{
struct page *a[MAX_PAGES];
unsigned int i, j;
do {
for (j = 0, i = 0;i < MAX_PAGES; i++, j++) {
a[i] = alloc_page(GFP_KERNEL);
if (!a[i])
break;
}
set_pages_array_wc(a, j);
set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout_interruptible(HZ);
for (i = 0; i < j; i++) {
unsigned long *addr = page_address(a[i]);
if (addr) {
memset(addr, 0xc2, PAGE_SIZE);
}
}
set_pages_array_wb(a, j);
for (i = 0; i< MAX_PAGES; i++) {
if (a[i])
__free_page(a[i]);
a[i] = NULL;
}
} while (!kthread_should_stop());
return 0;
}
static struct task_struct *t;
static int __init wb_to_wc_init(void)
{
t = kthread_run(thread, NULL, "wb_to_wc_and_back");
return 0;
}
static void __exit wb_to_wc_exit(void)
{
if (t)
kthread_stop(t);
}
module_init(wb_to_wc_init);
module_exit(wb_to_wc_exit);

This fixes RH BZ #742032
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Tom Goetz <tom.goetz@virtualcomputer.com>
CC: stable@kernel.org

xen/pm_idle: Make pm_idle be default_idle under Xen.

The idea behind commit d91ee5863b71 ("cpuidle: replace xen access to x86
pm_idle and default_idle") was to have one call - disable_cpuidle()
which would make pm_idle not be molested by other code.  It disallows
cpuidle_idle_call to be set to pm_idle (which is excellent).

But in the select_idle_routine() and idle_setup(), the pm_idle can still
be set to either: amd_e400_idle, mwait_idle or default_idle.  This
depends on some CPU flags (MWAIT) and in AMD case on the type of CPU.

In case of mwait_idle we can hit some instances where the hypervisor
(Amazon EC2 specifically) sets the MWAIT and we get:

  Brought up 2 CPUs
  invalid opcode: 0000 [#1] SMP

  Pid: 0, comm: swapper Not tainted 3.1.0-0.rc6.git0.3.fc16.x86_64 #1
  RIP: e030:[<ffffffff81015d1d>]  [<ffffffff81015d1d>] mwait_idle+0x6f/0xb4
  ...
  Call Trace:
   [<ffffffff8100e2ed>] cpu_idle+0xae/0xe8
   [<ffffffff8149ee78>] cpu_bringup_and_idle+0xe/0x10
  RIP  [<ffffffff81015d1d>] mwait_idle+0x6f/0xb4
   RSP <ffff8801d28ddf10>

In the case of amd_e400_idle we don't get so spectacular crashes, but we
do end up making an MSR which is trapped in the hypervisor, and then
follow it up with a yield hypercall.  Meaning we end up going to
hypervisor twice instead of just once.

The previous behavior before v3.0 was that pm_idle was set to
default_idle regardless of select_idle_routine/idle_setup.

We want to do that, but only for one specific case: Xen.  This patch
does that.

Fixes RH BZ #739499 and Ubuntu #881076
Reported-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

xen: map foreign pages for shared rings by updating the PTEs directly

When mapping a foreign page with xenbus_map_ring_valloc() with the
GNTTABOP_map_grant_ref hypercall, set the GNTMAP_contains_pte flag and
pass a pointer to the PTE (in init_mm).

After the page is mapped, the usual fault mechanism can be used to
update additional MMs. This allows the vmalloc_sync_all() to be
removed from alloc_vm_area().

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
[v1: Squashed fix by Michal for no-mmu case]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Michal Simek <monstr@monstr.eu>

net: xen-netback: use API provided by xenbus module to map rings

The xenbus module provides xenbus_map_ring_valloc() and
xenbus_map_ring_vfree(). Use these to map the Tx and Rx ring pages
granted by the frontend.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

block: xen-blkback: use API provided by xenbus module to map rings

The xenbus module provides xenbus_map_ring_valloc() and
xenbus_map_ring_vfree(). Use these to map the ring pages granted by
the frontend.

Acked-by: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: use generic functions instead of xen_{alloc, free}_vm_area()

Replace calls to the Xen-specific xen_alloc_vm_area() and
xen_free_vm_area() functions with the generic equivalent
(alloc_vm_area() and free_vm_area()).

On x86, these were identical already.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

block: xen-blkback: use API provided by xenbus module to map rings

The xenbus module provides xenbus_map_ring_valloc() and
xenbus_map_ring_vfree(). Use these to map the ring pages granted by
the frontend.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen-blkback: convert hole punching to discard request on loop devices

As of dfaa2ef68e80c378e610e3c8c536f1c239e8d3ef, loop devices support
discard request now. We could just issue a discard request, and
the loop driver will punch the hole for us, so we don't need to touch
the internals of loop device and punch the hole ourselves, Thanks.

V0->V1: rebased on devel/for-jens-3.3

Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Move processing of BLKIF_OP_DISCARD from dispatch_rw_block_io

.. and move it to its own function that will deal with the
discard operation.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blk[front|back]: Enhance discard support with secure erasing support.

Part of the blkdev_issue_discard(xx) operation is that it can also
issue a secure discard operation that will permanantly remove the
sectors in question. We advertise that we can support that via the
'discard-secure' attribute and on the request, if the 'secure' bit
is set, we will attempt to pass in REQ_DISCARD | REQ_SECURE.

CC: Li Dongyang <lidongyang@novell.com>
[v1: Used 'flag' instead of 'secure:1' bit]
[v2: Use 'reserved' uint8_t instead of adding a new value]
[v3: Check for nseg when mapping instead of operation]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blk[front|back]: Squash blkif_request_rw and blkif_request_discard together

In a union type structure to deal with the overlapping
attributes in a easier manner.

Suggested-by: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Merge branches 'stable/mmu.fixes.rebased' and 'stable/bug.fixes-3.2.rebased' into uek2-merge

* stable/mmu.fixes.rebased:
  xen-gntalloc: signedness bug in add_grefs()
  xen-gntalloc: integer overflow in gntalloc_ioctl_alloc()
  xen-gntdev: integer overflow in gntdev_alloc_map()
  xen:pvhvm: enable PVHVM VCPU placement when using more than 32 CPUs.
  xen/balloon: Avoid OOM when requesting highmem

* stable/bug.fixes-3.2.rebased:
  xen: Remove hanging references to CONFIG_XEN_PLATFORM_PCI

xen-gntalloc: signedness bug in add_grefs()

gref->gref_id is unsigned so the error handling didn't work.
gnttab_grant_foreign_access() returns an int type, so we can add a
cast here, and it doesn't cause any problems.
gnttab_grant_foreign_access() can return a variety of errors
including -ENOSPC, -ENOSYS and -ENOMEM.

CC: stable@kernel.org
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen-gntalloc: integer overflow in gntalloc_ioctl_alloc()

On 32 bit systems a high value of op.count could lead to an integer
overflow in the kzalloc() and gref_ids would be smaller than
expected. If the you triggered another integer overflow in
"if (gref_size + op.count > limit)" then you'd probably get memory
corruption inside add_grefs().

CC: stable@kernel.org
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen-gntdev: integer overflow in gntdev_alloc_map()

The multiplications here can overflow resulting in smaller buffer
sizes than expected. "count" comes from a copy_from_user().

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen:pvhvm: enable PVHVM VCPU placement when using more than 32 CPUs.

PVHVM running with more than 32 vcpus and pv_irq/pv_time enabled
need VCPU placement to work, or else it will softlockup.

CC: stable@kernel.org
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/balloon: Avoid OOM when requesting highmem

If highmem pages are requested from the balloon on a system without
highmem, the implementation of alloc_xenballooned_pages will allocate
all available memory trying to find highmem pages to return. Allow
low memory to be returned when highmem pages are requested to avoid
this loop.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: Remove hanging references to CONFIG_XEN_PLATFORM_PCI

In 5fbdc10395cd500d6ff844825a918c4e6f38de37 the XEN_PLATFORM_PCI config
option was removed, but references in header files remained. Clean up
those references.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/microcode: check proper return code.

After pulling in this change from your tree, I found the following bug,
when checking an enum value, which should be considered before inclusion:

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Merge branch 'upstream/microcode' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen into stable/misc

* 'upstream/microcode' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen:
  xen: add CPU microcode update driver
  xen: add dom0_op hypercall
  xen/acpi: Domain0 acpi parser related platform hypercall

xen/v86d: Fix /dev/mem to access memory below 1MB

We need to provide the VM_IO (and _PAGE_IOMAP) to see the contents
of the memory below 1MB.

Reported-and-Tested-by: Ben Guthro <ben.guthro@virtualcomputer.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Merge branch 'stable/xen-block.rebase' into uek2-merge

* stable/xen-block.rebase:
  xen/blkback: Fix two races in the handling of barrier requests.
  xen/blkback: Check for proper operation.
  xen/blkback: Fix the inhibition to map pages when discarding sector ranges.
  xen/blkback: Report VBD_WSECT (wr_sect) properly.
  xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests.
  xen-blkfront: plug device number leak in xlblk_init() error path
  xen-blkfront: If no barrier or flush is supported, use invalid operation.
  xen-blkback: use kzalloc() in favor of kmalloc()+memset()
  xen-blkback: fixed indentation and comments
  xen-blkfront: fix a deadlock while handling discard response
  xen-blkfront: Handle discard requests.
  xen-blkback: Implement discard requests ('feature-discard')
  xen-blkfront: add BLKIF_OP_DISCARD and discard request struct
  xen/blkback: Add module alias for autoloading
  xen/blkback: Don't let in-flight requests defer pending ones.

Conflicts:
drivers/block/xen-blkback/blkback.c

xen/blkback: Fix two races in the handling of barrier requests.

There are two windows of opportunity to cause a race when
processing a barrier request. This patch fixes this.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Check for proper operation.

The patch titled: "xen/blkback: Fix the inhibition to map pages
when discarding sector ranges." had the right idea except that
it used the wrong comparison operator. It had == instead of !=.

This fixes the bug where all (except discard) operations would
have been ignored.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Fix the inhibition to map pages when discarding sector ranges.

The 'operation' parameters are the ones provided to the bio layer while
the req->operation are the ones passed in between the backend and
frontend. We used the wrong 'operation' value to squash the
call to map pages when processing the discard operation resulting
in an hypercall that did nothing. Lets guard against going in the
mapping function by checking for the proper operation type.

CC: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Report VBD_WSECT (wr_sect) properly.

We did not increment the amount of sectors written to disk
b/c we tested for the == WRITE which is incorrect - as the
operations are more of WRITE_FLUSH, WRITE_ODIRECT. This patch
fixes it by doing a & WRITE check.

CC: stable@kernel.org
Reported-by: Andy Burns <xen.lists@burns.me.uk>
Suggested-by: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests.

We emulate the barrier requests by draining the outstanding bio's
and then sending the WRITE_FLUSH command. To drain the I/Os
we use the refcnt that is used during disconnect to wait for all
the I/Os before disconnecting from the frontend. We latch on its
value and if it reaches either the threshold for disconnect or when
there are no more outstanding I/Os, then we have drained all I/Os.

Suggested-by: Christopher Hellwig <hch@infradead.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen-blkfront: plug device number leak in xlblk_init() error path

... though after a failed xenbus_register_frontend() all may be lost.

Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen-blkfront: If no barrier or flush is supported, use invalid operation.

Guard against issuing BLKIF_OP_WRITE_BARRIER or BLKIF_OP_FLUSH_CACHE
by checking whether we successfully negotiated with the backend.
The negotiation with the backend also sets the q->flush_flags which
fortunately for us is also used when submitting an bio to us. If
we don't support barriers or flushes it would be set to zero so
we should never end up having to deal with REQ_FLUSH | REQ_FUA.

However, other third party implementations of __make_request that
might be stacked on top of us might not be so smart, so lets fix this up.

Acked-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen-blkback: use kzalloc() in favor of kmalloc()+memset()

This fixes the problem of three of those four memset()-s having
improper size arguments passed: Sizeof a pointer-typed expression
returns the size of the pointer, not that of the pointed to data.

It also reverts using kmalloc() instead of kzalloc() for the allocation
of the pending grant handles array, as that array gets fully
initialized in a subsequent loop.

Reported-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen-blkback: fixed indentation and comments

This patch fixes belows:

1. Fix code style issue.
2. Fix incorrect functions name in comments.

Signed-off-by: Joe Jin <joe.jin@oracle.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: Ian Campbell <Ian.Campbell@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen-blkfront: fix a deadlock while handling discard response

When we get -EOPNOTSUPP response for a discard request, we will clear
the discard flag on the request queue so we won't attempt to send discard
requests to backend again, and this should be protected under rq->queue_lock.
However, when we setup the request queue, we pass blkif_io_lock to
blk_init_queue so rq->queue_lock is blkif_io_lock indeed, and this lock
is already taken when we are in blkif_interrpt, so remove the
spin_lock/spin_unlock when we clear the discard flag or we will end up
with deadlock here

Signed-off-by: Li Dongyang <lidongyang@novell.com>
[v1: Updated description a bit and removed comment from source]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen-blkfront: Handle discard requests.

If the backend advertises 'feature-discard', then interrogate
the backend for alignment and granularity. Setup the request
queue with the appropiate values and send the discard operation
as required.

Signed-off-by: Li Dongyang <lidongyang@novell.com>
[v1: Amended commit description]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen-blkback: Implement discard requests ('feature-discard')

..aka ATA TRIM/SCSI UNMAP command to be passed through the frontend
and used as appropiately by the backend. We also advertise
certain granulity parameters to the frontend so it can plug them in.
If the backend is a realy device - we just end up using
'blkdev_issue_discard' while for loopback devices - we just punch
a hole in the image file.

Signed-off-by: Li Dongyang <lidongyang@novell.com>
[v1: Fixed up pr_debug and commit description]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen-blkfront: add BLKIF_OP_DISCARD and discard request struct

Now we use BLKIF_OP_DISCARD and add blkif_request_discard to blkif_request union,
the patch is taken from Owen Smith and Konrad, Thanks

Signed-off-by: Owen Smith <owen.smith@citrix.com>
Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Add module alias for autoloading

Add xen-backend:vbd module alias to the xen-blkback module. This allows
automatic loading of the module.

Signed-off-by: Bastian Blank <waldi@debian.org>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/blkback: Don't let in-flight requests defer pending ones.

Running RING_FINAL_CHECK_FOR_REQUESTS from make_response is a bad
idea. It means that in-flight I/O is essentially blocking continued
batches. This essentially kills throughput on frontends which unplug
(or even just notify) early and rightfully assume addtional requests
will be picked up on time, not synchronously.

Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com>
[v1: Rebased and fixed compile problems]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Merge branch 'stable/xen-settime' into uek2-merge

* stable/xen-settime:
  xen/dom0: set wallclock time in Xen
  xen: add dom0_op hypercall
  xen/acpi: Domain0 acpi parser related platform hypercall

Merge branch 'stable/e820-3.2.rebased' into uek2-merge

* stable/e820-3.2.rebased:
  xen: release all pages within 1-1 p2m mappings
  xen: allow extra memory to be in multiple regions
  xen: allow balloon driver to use more than one memory region
  xen/balloon: simplify test for the end of usable RAM
  xen/balloon: account for pages released during memory setup
  xen/e820: if there is no dom0_mem=, don't tweak extra_pages.
  Revert "xen/e820: if there is no dom0_mem=, don't tweak extra_pages."
  xen/e820: if there is no dom0_mem=, don't tweak extra_pages.
  xen: use maximum reservation to limit amount of usable RAM
  xen: Fix misleading WARN message at xen_release_chunk
  xen: Fix printk() format in xen/setup.c

Conflicts:
arch/x86/xen/setup.c

Merge branch 'stable/mmu.fixes.rebased' into uek2-merge

* stable/mmu.fixes.rebased:
  xen/gntdev: Fix sleep-inside-spinlock
  xen: modify kernel mappings corresponding to granted pages
  xen: add an "highmem" parameter to alloc_xenballooned_pages
  xen/p2m: Use SetPagePrivate and its friends for M2P overrides.
  xen/p2m: Make debug/xen/mmu/p2m visible again.
  Revert "xen/debug: WARN_ON when identity PFN has no _PAGE_IOMAP flag set."

Conflicts:
drivers/xen/balloon.c
include/xen/balloon.h

Merge branch 'stable/drivers-3.2.rebased' into uek2-merge

* stable/drivers-3.2.rebased:
  xen: use static initializers in xen-balloon.c
  xenbus: don't rely on xen_initial_domain to detect local xenstore
  xenbus: Fix loopback event channel assuming domain 0
  xen/pv-on-hvm:kexec: Fix implicit declaration of function 'xen_hvm_domain'
  xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel
  xen/pv-on-hvm kexec: update xs_wire.h:xsd_sockmsg_type from xen-unstable
  xen/pv-on-hvm kexec+kdump: reset PV devices in kexec or crash kernel
  xen/pv-on-hvm kexec: rebind virqs to existing eventchannel ports
  xen/pv-on-hvm kexec: prevent crash in xenwatch_thread() when stale watch events arrive

Conflicts:
drivers/xen/xen-balloon.c
drivers/xen/xenbus/xenbus_probe.c

Merge branch 'stable/cleanups-3.2.rebased' into uek2-merge

* stable/cleanups-3.2.rebased:
  Xen: fix braces and tabs coding style issue in xenbus_probe.c
  Xen: fix braces coding style issue in xenbus_probe.h
  Xen: fix whitespaces,tabs coding style issue in drivers/xen/pci.c
  Xen: fix braces coding style issue in gntdev.c and grant-table.c
  Xen: fix whitespaces,tabs coding style issue in drivers/xen/events.c
  Xen: fix whitespaces,tabs coding style issue in drivers/xen/balloon.c

Conflicts:
drivers/xen/pci.c

Merge branch 'stable/pci.fixes-3.2' of git://oss.oracle.com/git/kwilk/xen into uek2-merge

* 'stable/pci.fixes-3.2' of git://oss.oracle.com/git/kwilk/xen:
  xen/pci: support multi-segment systems
  xen-swiotlb: When doing coherent alloc/dealloc check before swizzling the MFNs.
  xen/pci: make bus notifier handler return sane values
  xen-swiotlb: fix printk and panic args
  xen-swiotlb: Fix wrong panic.
  xen-swiotlb: Retry up three times to allocate Xen-SWIOTLB
  xen-pcifront: Update warning comment to use 'e820_host' option.

Merge branch 'stable/bug.fixes-3.2.rebased' of git://oss.oracle.com/git/kwilk/xen into uek2-merge

* 'stable/bug.fixes-3.2.rebased' of git://oss.oracle.com/git/kwilk/xen:
  xen/irq: If we fail during msi_capability_init return proper error code.
  xen: remove XEN_PLATFORM_PCI config option
  xen: XEN_PVHVM depends on PCI
  xen/p2m/debugfs: Make type_name more obvious.
  xen/p2m/debugfs: Fix potential pointer exception.
  xen/enlighten: Fix compile warnings and set cx to known value.
  xen/xenbus: Remove the unnecessary check.
  xen/events: Don't check the info for NULL as it is already done.
  xen/pci: Use 'acpi_gsi_to_irq' value unconditionally.
  xen/pci: Remove 'xen_allocate_pirq_gsi'.
  xen/pci: Retire unnecessary #ifdef CONFIG_ACPI
  xen/pci: Move the allocation of IRQs when there are no IOAPIC's to the end
  xen/pci: Squash pci_xen_initial_domain and xen_setup_pirqs together.
  xen/pci: Use the xen_register_pirq for HVM and initial domain users
  xen/pci: In xen_register_pirq bind the GSI to the IRQ after the hypercall.
  xen/pci: Provide #ifdef CONFIG_ACPI to easy code squashing.
  xen/pci: Update comments and fix empty spaces.
  xen/pci: Shuffle code around.

Conflicts:
arch/x86/pci/xen.c
drivers/xen/Makefile

Merge branch 'stable/xen-pciback-0.6.3.bugfixes' of git://oss.oracle.com/git/kwilk/xen into uek2-merge

* 'stable/xen-pciback-0.6.3.bugfixes' of git://oss.oracle.com/git/kwilk/xen:
  xen/pciback: Check if the device is found instead of blindly assuming so.
  xen/pciback: Do not dereference psdev during printk when it is NULL.
  xen/pciback: double lock typo
  xen/pciback: use mutex rather than spinlock in vpci backend
  xen/pciback: Use mutexes when working with Xenbus state transitions.
  xen/pciback: miscellaneous adjustments
  xen/pciback: use mutex rather than spinlock in passthrough backend
  xen/pciback: use resource_size()

xen/irq: If we fail during msi_capability_init return proper error code.

There are three different modes: PV, HVM, and initial domain 0. In all
the cases we would return -1 for failure instead of a proper error code.
Fix this by propagating the error code from the generic IRQ code.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: remove XEN_PLATFORM_PCI config option

Xen PVHVM needs xen-platform-pci, on the other hand xen-platform-pci is
useless in any other cases.
Therefore remove the XEN_PLATFORM_PCI config option and compile
xen-platform-pci built-in if XEN_PVHVM is selected.

Changes to v1:

- remove xen-platform-pci.o and just use platform-pci.o since it is not
externally visible anymore.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:

drivers/xen/Makefile

xen: XEN_PVHVM depends on PCI

Xen PV on HVM guests require PCI support because they need the
xen-platform-pci driver in order to initialize xenbus.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/p2m/debugfs: Make type_name more obvious.

Per Ian Campbell suggestion to defend against future breakage
in case we expand the P2M values, incorporate the defines
in the string array.

Suggested-by: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/p2m/debugfs: Fix potential pointer exception.

We could be referencing the last + 1 element of level_name[]
array which would cause a pointer exception, because of the
initial setup of lvl=4.

[v1: No need to do this for type_name, pointed out by Ian Campbell]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/enlighten: Fix compile warnings and set cx to known value.

We get:
linux/arch/x86/xen/enlighten.c: In function ‘xen_start_kernel’:
linux/arch/x86/xen/enlighten.c:226: warning: ‘cx’ may be used uninitialized in this function
linux/arch/x86/xen/enlighten.c:240: note: ‘cx’ was declared here

and the cx is really not set but passed in the xen_cpuid instruction
which masks the value with returned masked_ecx from cpuid. This
can potentially lead to invalid data being stored in cx.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/xenbus: Remove the unnecessary check.

.. we check whether 'xdev' is NULL - but there is no need for
it as the 'dev' check is done before. The 'dev' is embedded in
the 'xdev' so having xdev != NULL with dev being being checked
is not going to happen.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/events: Don't check the info for NULL as it is already done.

The list operation checks whether the 'info' structure that is
retrieved from the list is NULL (otherwise it would not been able
to retrieve it). This check is not neccessary.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pci: Use 'acpi_gsi_to_irq' value unconditionally.

In the past we would only use the function's value if the
returned value was not equal to 'acpi_sci_override_gsi'. Meaning
that the INT_SRV_OVR values for global and source irq were different.
But it is OK to use the function's value even when the global
and source irq are the same.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pci: Remove 'xen_allocate_pirq_gsi'.

In the past (2.6.38) the 'xen_allocate_pirq_gsi' would allocate
an entry in a Linux IRQ -> {XEN_IRQ, type, event, ..} array. All
of that has been removed in 2.6.39 and the Xen IRQ subsystem uses
an linked list that is populated when the call to
'xen_allocate_irq_gsi' (universally done from any of the xen_bind_*
calls) is done. The 'xen_allocate_pirq_gsi' is a NOP and there is
no need for it anymore so lets remove it.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pci: Retire unnecessary #ifdef CONFIG_ACPI

As the code paths that are guarded by CONFIG_XEN_DOM0 already depend
on CONFIG_ACPI so the extra #ifdef is not required. The earlier
patch that added them in had done its job.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pci: Move the allocation of IRQs when there are no IOAPIC's to the end

.. which means we can preset of NR_IRQS_LEGACY interrupts using
the 'acpi_get_override_irq' API before this loop.
This means that we can get the IRQ's polarity (and trigger) from either
the ACPI (or MP); or use the default values. This fixes a bug if we did
not have an IOAPIC we would not been able to preset the IRQ's polarity
if the MP table existed.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pci: Squash pci_xen_initial_domain and xen_setup_pirqs together.

Since they are only called once and the rest of the pci_xen_*
functions follow the same pattern of setup.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pci: Use the xen_register_pirq for HVM and initial domain users

.. to cut down on the code duplicity.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pci: In xen_register_pirq bind the GSI to the IRQ after the hypercall.

Not before .. also that code segment starts looking like the HVM one.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pci: Provide #ifdef CONFIG_ACPI to easy code squashing.

In the past we would guard those code segments to be dependent
on CONFIG_XEN_DOM0 (which depends on CONFIG_ACPI) so this patch is
not stricly necessary. But the next patch will merge common
HVM and initial domain code and we want to make sure the CONFIG_ACPI
dependency is preserved - as HVM code does not depend on CONFIG_XEN_DOM0.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pci: Update comments and fix empty spaces.

Update the out-dated comment at the beginning of the file.
Also provide the copyrights of folks who have been contributing
to this code lately.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pci: Shuffle code around.

The file is hard to read. Move the code around so that
the contents of it follows a uniform format:
- setup GSIs - PV, HVM, and initial domain case
- then MSI/MSI-x setup - PV, HVM and then initial domain case.
- then MSI/MSI-x teardown - same order.
- lastly, the __init functions in PV, HVM, and initial domain order.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/dom0: set wallclock time in Xen

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

xen: add dom0_op hypercall

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>

xen/acpi: Domain0 acpi parser related platform hypercall

This patches implements the xen_platform_op hypercall, to pass the parsed
ACPI info to hypervisor.

Signed-off-by: Yu Ke <ke.yu@intel.com>
Signed-off-by: Tian Kevin <kevin.tian@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
[v1: Added DEFINE_GUEST.. in appropiate headers]
[v2: Ripped out typedefs]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: release all pages within 1-1 p2m mappings

In xen_memory_setup() all reserved regions and gaps are set to an
identity (1-1) p2m mapping. If an available page has a PFN within one
of these 1-1 mappings it will become inaccessible (as it MFN is lost)
so release them before setting up the mapping.

This can make an additional 256 MiB or more of RAM available
(depending on the size of the reserved regions in the memory map) if
the initial pages overlap with reserved regions.

The 1:1 p2m mappings are also extended to cover partial pages. This
fixes an issue with (for example) systems with a BIOS that puts the
DMI tables in a reserved region that begins on a non-page boundary.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: allow extra memory to be in multiple regions

Allow the extra memory (used by the balloon driver) to be in multiple
regions (typically two regions, one for low memory and one for high
memory). This allows the balloon driver to increase the number of
available low pages (if the initial number if pages is small).

As a side effect, the algorithm for building the e820 memory map is
simpler and more obviously correct as the map supplied by the
hypervisor is (almost) used as is (in particular, all reserved regions
and gaps are preserved). Only RAM regions are altered and RAM regions
above max_pfn + extra_pages are marked as unused (the region is split
in two if necessary).

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: allow balloon driver to use more than one memory region

Allow the xen balloon driver to populate its list of extra pages from
more than one region of memory. This will allow platforms to provide
(for example) a region of low memory and a region of high memory.

The maximum possible number of extra regions is 128 (== E820MAX) which
is quite large so xen_extra_mem is placed in __initdata. This is safe
as both xen_memory_setup() and balloon_init() are in __init.

The balloon regions themselves are not altered (i.e., there is still
only the one region).

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/balloon: simplify test for the end of usable RAM

When initializing the balloon only max_pfn needs to be checked
(max_pfn will always be <= e820_end_of_ram_pfn()) and improve the
confusing comment.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/balloon: account for pages released during memory setup

In xen_memory_setup() pages that occur in gaps in the memory map are
released back to Xen.  This reduces the domain's current page count in
the hypervisor.  The Xen balloon driver does not correctly decrease
its initial current_pages count to reflect this.  If 'delta' pages are
released and the target is adjusted the resulting reservation is
always 'delta' less than the requested target.

This affects dom0 if the initial allocation of pages overlaps the PCI
memory region but won't affect most domU guests that have been setup
with pseudo-physical memory maps that don't have gaps.

Fix this by accouting for the released pages when starting the balloon
driver.

If the domain's targets are managed by xapi, the domain may eventually
run out of memory and die because xapi currently gets its target
calculations wrong and whenever it is restarted it always reduces the
target by 'delta'.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/e820: if there is no dom0_mem=, don't tweak extra_pages.

The patch "xen: use maximum reservation to limit amount of usable RAM"
(d312ae878b6aed3912e1acaaf5d0b2a9d08a4f11) breaks machines that
do not use 'dom0_mem=' argument with:

reserve RAM buffer: 000000133f2e2000 - 000000133fffffff
(XEN) mm.c:4976:d0 Global bit is set to kernel page fffff8117e
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
...

The reason being that the last E820 entry is created using the
'extra_pages' (which is based on how many pages have been freed).
The mentioned git commit sets the initial value of 'extra_pages'
using a hypercall which returns the number of pages (if dom0_mem
has been used) or -1 otherwise. If the later we return with
MAX_DOMAIN_PAGES as basis for calculation:

    return min(max_pages, MAX_DOMAIN_PAGES);

and use it:

     extra_limit = xen_get_max_pages();
     if (extra_limit >= max_pfn)
             extra_pages = extra_limit - max_pfn;
     else
             extra_pages = 0;

which means we end up with extra_pages = 128GB in PFNs (33554432)
- 8GB in PFNs (2097152, on this specific box, can be larger or smaller),
and then we add that value to the E820 making it:

  Xen: 00000000ff000000 - 0000000100000000 (reserved)
  Xen: 0000000100000000 - 000000133f2e2000 (usable)

which is clearly wrong. It should look as so:

  Xen: 00000000ff000000 - 0000000100000000 (reserved)
  Xen: 0000000100000000 - 000000027fbda000 (usable)

Naturally this problem does not present itself if dom0_mem=max:X
is used.

CC: stable@kernel.org
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Revert "xen/e820: if there is no dom0_mem=, don't tweak extra_pages."

This reverts commit 38ec5d3381179924085ac3b21caa8f27da9ba8b3.

We will use an version that went upstream.

xen/e820: if there is no dom0_mem=, don't tweak extra_pages.

The patch "xen: use maximum reservation to limit amount of usable RAM"
(d312ae878b6aed3912e1acaaf5d0b2a9d08a4f11) breaks machines that
do not use 'dom0_mem=' argument with:

reserve RAM buffer: 000000133f2e2000 - 000000133fffffff
(XEN) mm.c:4976:d0 Global bit is set to kernel page fffff8117e
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
...

The reason being that the last E820 entry is created using the
'extra_pages' (which is based on how many pages have been freed).
The mentioned git commit sets the initial value of 'extra_pages'
using a hypercall which returns the number of pages (if dom0_mem
has been used) or -1 otherwise. If the later we return with
MAX_DOMAIN_PAGES as basis for calculation:

    return min(max_pages, MAX_DOMAIN_PAGES);

and use it:

     extra_limit = xen_get_max_pages();
     if (extra_limit >= max_pfn)
             extra_pages = extra_limit - max_pfn;
     else
             extra_pages = 0;

which means we end up with extra_pages = 128GB in PFNs (33554432)
- 8GB in PFNs (2097152, on this specific box, can be larger or smaller),
and then we add that value to the E820 making it:

  Xen: 00000000ff000000 - 0000000100000000 (reserved)
  Xen: 0000000100000000 - 000000133f2e2000 (usable)

which is clearly wrong. It should look as so:

  Xen: 00000000ff000000 - 0000000100000000 (reserved)
  Xen: 0000000100000000 - 000000027fbda000 (usable)

Naturally this problem does not present itself if dom0_mem=max:X
is used.

CC: stable@kernel.org
CC: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: use maximum reservation to limit amount of usable RAM

Use the domain's maximum reservation to limit the amount of extra RAM
for the memory balloon. This reduces the size of the pages tables and
the amount of reserved low memory (which defaults to about 1/32 of the
total RAM).

On a system with 8 GiB of RAM with the domain limited to 1 GiB the
kernel reports:

Before:

Memory: 627792k/4472000k available

After:

Memory: 549740k/11132224k available

A increase of about 76 MiB (~1.5% of the unused 7 GiB). The reserved
low memory is also reduced from 253 MiB to 32 MiB. The total
additional usable RAM is 329 MiB.

For dom0, this requires at patch to Xen ('x86: use 'dom0_mem' to limit
the number of pages for dom0') (c/s 23790)

CC: stable@kernel.org
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: Fix misleading WARN message at xen_release_chunk

WARN message should not complain
"Failed to release memory %lx-%lx err=%d\n"
^^^^^^^
about range when it fails to release just one page,
instead it should say what pfn is not freed.

In addition line:
printk(KERN_INFO "xen_release_chunk: looking at area pfn %lx-%lx: "
...
printk(KERN_CONT "%lu pages freed\n", len);
will be broken if WARN in between this line is fired. So fix it
by using a single printk for this.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: Fix printk() format in xen/setup.c

Use correct format specifier for unsigned long.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/gntdev: Fix sleep-inside-spinlock

BUG: sleeping function called from invalid context at /local/scratch/dariof/linux/kernel/mutex.c:271
in_atomic(): 1, irqs_disabled(): 0, pid: 3256, name: qemu-dm
1 lock held by qemu-dm/3256:
#0: (&(&priv->lock)->rlock){......}, at: [<ffffffff813223da>] gntdev_ioctl+0x2bd/0x4d5
Pid: 3256, comm: qemu-dm Tainted: G W 3.1.0-rc8+ #5
Call Trace:
[<ffffffff81054594>] __might_sleep+0x131/0x135
[<ffffffff816bd64f>] mutex_lock_nested+0x25/0x45
[<ffffffff8131c7c8>] free_xenballooned_pages+0x20/0xb1
[<ffffffff8132194d>] gntdev_put_map+0xa8/0xdb
[<ffffffff816be546>] ? _raw_spin_lock+0x71/0x7a
[<ffffffff813223da>] ? gntdev_ioctl+0x2bd/0x4d5
[<ffffffff8132243c>] gntdev_ioctl+0x31f/0x4d5
[<ffffffff81007d62>] ? check_events+0x12/0x20
[<ffffffff811433bc>] do_vfs_ioctl+0x488/0x4d7
[<ffffffff81007d4f>] ? xen_restore_fl_direct_reloc+0x4/0x4
[<ffffffff8109168b>] ? lock_release+0x21c/0x229
[<ffffffff81135cdd>] ? rcu_read_unlock+0x21/0x32
[<ffffffff81143452>] sys_ioctl+0x47/0x6a
[<ffffffff816bfd82>] system_call_fastpath+0x16/0x1b

gntdev_put_map tries to acquire a mutex when freeing pages back to the
xenballoon pool, so it cannot be called with a spinlock held. In
gntdev_release, the spinlock is not needed as we are freeing the
structure later; in the ioctl, only the list manipulation needs to be
under the lock.

Reported-and-Tested-By: Dario Faggioli <dario.faggioli@citrix.com>
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: modify kernel mappings corresponding to granted pages

If we want to use granted pages for AIO, changing the mappings of a user
vma and the corresponding p2m is not enough, we also need to update the
kernel mappings accordingly.
Currently this is only needed for pages that are created for user usages
through /dev/xen/gntdev. As in, pages that have been in use by the
kernel and use the P2M will not need this special mapping.
However there are no guarantees that in the future the kernel won't
start accessing pages through the 1:1 even for internal usage.

In order to avoid the complexity of dealing with highmem, we allocated
the pages lowmem.
We issue a HYPERVISOR_grant_table_op right away in
m2p_add_override and we remove the mappings using another
HYPERVISOR_grant_table_op in m2p_remove_override.
Considering that m2p_add_override and m2p_remove_override are called
once per page we use multicalls and hypercall batching.

Use the kmap_op pointer directly as argument to do the mapping as it is
guaranteed to be present up until the unmapping is done.
Before issuing any unmapping multicalls, we need to make sure that the
mapping has already being done, because we need the kmap->handle to be
set correctly.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
[v1: Removed GRANT_FRAME_BIT usage]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: add an "highmem" parameter to alloc_xenballooned_pages

Add an highmem parameter to alloc_xenballooned_pages, to allow callers to
request lowmem or highmem pages.

Fix the code style of free_xenballooned_pages' prototype.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
[v1: Contextual merge conflict - self-ballooning has been already merged]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/p2m: Use SetPagePrivate and its friends for M2P overrides.

We use the page->private field and hence should use the proper
macros and set proper bits. Also WARN_ON in case somebody
tries to overwrite our data.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/p2m: Make debug/xen/mmu/p2m visible again.

We dropped a lot of the MMU debugfs in favour of using
tracing API - but there is one which just provides
mostly static information that was made invisible by this change.

Bring it back.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Revert "xen/debug: WARN_ON when identity PFN has no _PAGE_IOMAP flag set."

We don' use it anymore and there are more false positives.

This reverts commit fc25151d9ac7d809239fe68de0a1490b504bb94a.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pciback: Check if the device is found instead of blindly assuming so.

Just in case it is not found, don't try to dereference it.

[v1: Added WARN_ON, suggested by Jan Beulich <JBeulich@suse.com>]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pciback: Do not dereference psdev during printk when it is NULL.

.. instead use BUG_ON() as all the callers of the kill_domain_by_device
check for psdev.

Suggested-by: Jan Beulich <JBeulich@suse.com>
[v1: Rebased on top " xen/pciback: miscellaneous adjustments" causes some conflicts]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pciback: double lock typo

We called mutex_lock() twice instead of unlocking.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pciback: use mutex rather than spinlock in vpci backend

Similar to the "xen/pciback: use mutex rather than spinlock in passthrough backend"
this patch converts the vpci backend to use a mutex instead of
a spinlock. Note that the code taking the lock won't ever get called
from non-sleepable context

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pciback: Use mutexes when working with Xenbus state transitions.

The caller that orchestrates the state changes is xenwatch_thread
and it takes a mutex. In our processing of Xenbus states we can take
the luxery of going to sleep on a mutex, so lets do that and
also fix this bug:

BUG: sleeping function called from invalid context at /linux/kernel/mutex.c:271
in_atomic(): 1, irqs_disabled(): 0, pid: 32, name: xenwatch
2 locks held by xenwatch/32:
#0: (xenwatch_mutex){......}, at: [<ffffffff813856ab>] xenwatch_thread+0x4b/0x180
#1: (&(&pdev->dev_lock)->rlock){......}, at: [<ffffffff8138f05b>] xen_pcibk_disconnect+0x1b/0x80
Pid: 32, comm: xenwatch Not tainted 3.1.0-rc6-00015-g3ce340d #2
Call Trace:
[<ffffffff810892b2>] __might_sleep+0x102/0x130
[<ffffffff8163b90f>] mutex_lock_nested+0x2f/0x50
[<ffffffff81382c1c>] unbind_from_irq+0x2c/0x1b0
[<ffffffff8110da66>] ? free_irq+0x56/0xb0
[<ffffffff81382dbc>] unbind_from_irqhandler+0x1c/0x30
[<ffffffff8138f06b>] xen_pcibk_disconnect+0x2b/0x80
[<ffffffff81390348>] xen_pcibk_frontend_changed+0xe8/0x140
[<ffffffff81387ac2>] xenbus_otherend_changed+0xd2/0x150
[<ffffffff810895c1>] ? get_parent_ip+0x11/0x50
[<ffffffff81387de0>] frontend_changed+0x10/0x20
[<ffffffff81385712>] xenwatch_thread+0xb2/0x180

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pciback: miscellaneous adjustments

This is a minor bugfix and a set of small cleanups; as it is not clear
whether this needs splitting into pieces (and if so, at what
granularity), it is a single combined patch.
- add a missing return statement to an error path in
kill_domain_by_device()
- use pci_is_enabled() rather than raw atomic_read()
- remove a bogus attempt to zero-terminate an already zero-terminated
string
- #define DRV_NAME once uniformly in the shared local header
- make DRIVER_ATTR() variables static
- eliminate a pointless use of list_for_each_entry_safe()
- add MODULE_ALIAS()
- a little bit of constification
- adjust a few messages
- remove stray semicolons from inline function definitions

Signed-off-by: Jan Beulich <jbeulich@suse.com>
[v1: Dropped the resource_size fix, altered the description]
[v2: Fixed cleanpatch.pl comments]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pciback: use mutex rather than spinlock in passthrough backend

To accommodate the call to the callback function from
__xen_pcibk_publish_pci_roots(), which so far dropped and the re-
acquired the lock without checking that the list didn't actually
change, convert the code to use a mutex instead (observing that the
code taking the lock won't ever get called from non-sleepable
context).

As a result, drop the bogus use of list_for_each_entry_safe() and
remove the inappropriate dropping of the lock.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen/pciback: use resource_size()

Use resource_size function on resource object
instead of explicit computation.

The semantic patch that makes this output is available
in scripts/coccinelle/api/resource_size.cocci.

More information about semantic patching is available at
http://coccinelle.lip6.fr/

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: use static initializers in xen-balloon.c

There is no need to use dynamic initializaion, it just confuses the reader.
Switch to static initializers like its used in other files.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
[v2: Rebased on v3.0]
[v3: Rebased on v3.0 without Dan's changes to balloon driver]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Xen: fix braces and tabs coding style issue in xenbus_probe.c

This is a patch to the xenbus_probe.c file that fixed up braces and tabs errors found by the checkpatch.pl tools.

Signed-off-by: Ruslan Pisarev <ruslan@rpisarev.org.ua>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Xen: fix braces coding style issue in xenbus_probe.h

This is a patch to the xenbus_probe.h file that fixed up braces errors found by the checkpatch.pl tools.

Signed-off-by: Ruslan Pisarev <ruslan@rpisarev.org.ua>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Xen: fix whitespaces,tabs coding style issue in drivers/xen/pci.c

This is a patch to the pci.c file that fixed up whitespaces, tabs warnings found by the checkpatch.pl tools.

Signed-off-by: Ruslan Pisarev <ruslan@rpisarev.org.ua>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>