Konrad Rzeszutek Wilk [Thu, 2 Feb 2012 20:03:05 +0000 (15:03 -0500)]
Merge branch 'stable/for-linus-3.3.rebased' into uek2-merge
* stable/for-linus-3.3.rebased: (39 commits)
Merge conflict resolved. Somehow the letter 's' slipped in the Makefile. This fixes the compile issues.
xen/events: BUG() when we can't allocate our event->irq array.
xen/granttable: Disable grant v2 for HVM domains.
xen-blkfront: Use kcalloc instead of kzalloc to allocate array
xen/pciback: Expand the warning message to include domain id.
xen/pciback: Fix "device has been assigned to X domain!" warning
xen/xenbus: don't reimplement kvasprintf via a fixed size buffer
xenbus: maximum buffer size is XENSTORE_PAYLOAD_MAX
xen/xenbus: Reject replies with payload > XENSTORE_PAYLOAD_MAX.
Xen: consolidate and simplify struct xenbus_driver instantiation
xen-gntalloc: introduce missing kfree
xen/xenbus: Fix compile error - missing header for xen_initial_domain()
xen/netback: Enable netback on HVM guests
xen/grant-table: Support mappings required by blkback
xenbus: Use grant-table wrapper functions
xenbus: Support HVM backends
xen/xenbus-frontend: Fix compile error with randconfig
xen/xenbus-frontend: Make error message more clear
xen/privcmd: Remove unused support for arch specific privcmp mmap
xen: Add xenbus_backend device
...
Thomas Meyer [Tue, 29 Nov 2011 21:08:00 +0000 (22:08 +0100)]
xen-blkfront: Use kcalloc instead of kzalloc to allocate array
The advantage of kcalloc is, that will prevent integer overflows which could
result from the multiplication of number of elements and size and it is also
a bit nicer to read.
The semantic patch that makes this change is available
in https://lkml.org/lkml/2011/11/25/107
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
[v1: Seperated the drivers/block/cciss_scsi.c out of this patch] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Wed, 4 Jan 2012 19:16:45 +0000 (14:16 -0500)]
xen/pciback: Expand the warning message to include domain id.
When a PCI device is transferred to another domain and it is still
in usage (from the internal perspective), mention which other
domain is using it to aid in debugging.
[v2: Truncate the verbose message per Jan Beulich suggestion]
[v3: Suggestions from Ian Campbell on the wording] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Konrad Rzeszutek Wilk [Wed, 4 Jan 2012 20:11:02 +0000 (15:11 -0500)]
xen/pciback: Fix "device has been assigned to X domain!" warning
The full warning is:
"pciback 0000:05:00.0: device has been assigned to 2 domain! Over-writting the ownership, but beware."
which is correct - the previous domain that was using the device
forgot to unregister the ownership. This patch fixes this by
calling the unregister ownership function when the PCI device is
relinquished from the guest domain.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Ian Campbell [Wed, 4 Jan 2012 11:39:51 +0000 (11:39 +0000)]
xenbus: maximum buffer size is XENSTORE_PAYLOAD_MAX
Use this now that it is defined even though it happens to be == PAGE_SIZE.
The code which takes requests from userspace already validates against the size
of this buffer so no further checks are required to ensure that userspace
requests comply with the protocol in this respect.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Haogang Chen <haogangchen@gmail.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Ian Campbell [Wed, 4 Jan 2012 09:34:49 +0000 (09:34 +0000)]
xen/xenbus: Reject replies with payload > XENSTORE_PAYLOAD_MAX.
Haogang Chen found out that:
There is a potential integer overflow in process_msg() that could result
in cross-domain attack.
body = kmalloc(msg->hdr.len + 1, GFP_NOIO | __GFP_HIGH);
When a malicious guest passes 0xffffffff in msg->hdr.len, the subsequent
call to xb_read() would write to a zero-length buffer.
The other end of this connection is always the xenstore backend daemon
so there is no guest (malicious or otherwise) which can do this. The
xenstore daemon is a trusted component in the system.
However this seem like a reasonable robustness improvement so we should
have it.
And Ian when read the API docs found that:
The payload length (len field of the header) is limited to 4096
(XENSTORE_PAYLOAD_MAX) in both directions. If a client exceeds the
limit, its xenstored connection will be immediately killed by
xenstored, which is usually catastrophic from the client's point of
view. Clients (particularly domains, which cannot just reconnect)
should avoid this.
so this patch checks against that instead.
This also avoids a potential integer overflow pointed out by Haogang Chen.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Haogang Chen <haogangchen@gmail.com> CC: stable@kernel.org Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Jan Beulich [Thu, 22 Dec 2011 09:08:13 +0000 (09:08 +0000)]
Xen: consolidate and simplify struct xenbus_driver instantiation
The 'name', 'owner', and 'mod_name' members are redundant with the
identically named fields in the 'driver' sub-structure. Rather than
switching each instance to specify these fields explicitly, introduce
a macro to simplify this.
Eliminate further redundancy by allowing the drvname argument to
DEFINE_XENBUS_DRIVER() to be blank (in which case the first entry from
the ID table will be used for .driver.name).
Also eliminate the questionable xenbus_register_{back,front}end()
wrappers - their sole remaining purpose was the checking of the
'owner' field, proper setting of which shouldn't be an issue anymore
when the macro gets used.
v2: Restore DRV_NAME for the driver name in xen-pciback.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Thu, 2 Feb 2012 19:21:59 +0000 (14:21 -0500)]
Merge branch 'stable/xen-pciback-0.6.3.bugfixes' into stable/for-linus-3.3.rebased
* stable/xen-pciback-0.6.3.bugfixes: (22 commits)
xen/pciback: Check if the device is found instead of blindly assuming so.
xen/pciback: Do not dereference psdev during printk when it is NULL.
xen/pciback: double lock typo
xen/pciback: use mutex rather than spinlock in vpci backend
xen/pciback: Use mutexes when working with Xenbus state transitions.
xen/pciback: miscellaneous adjustments
xen/pciback: use mutex rather than spinlock in passthrough backend
xen/pciback: use resource_size()
xen/pciback: remove duplicated #include
xen/pciback: Have 'passthrough' option instead of XEN_PCIDEV_BACKEND_PASS and XEN_PCIDEV_BACKEND_VPCI
xen/pciback: Remove the DEBUG option.
xen/pciback: Drop two backends, squash and cleanup some code.
xen/pciback: Print out the MSI/MSI-X (PIRQ) values
xen/pciback: Don't setup an fake IRQ handler for SR-IOV devices.
xen: rename pciback module to xen-pciback.
xen/pciback: Fine-grain the spinlocks and fix BUG: scheduling while atomic cases.
xen/pciback: Allocate IRQ handler for device that is shared with guest.
xen/pciback: Disable MSI/MSI-X when reseting a device
xen/pciback: guest SR-IOV support for PV guest
xen/pciback: Register the owner (domain) of the PCI device.
...
Konrad Rzeszutek Wilk [Thu, 2 Feb 2012 19:20:07 +0000 (14:20 -0500)]
Merge branch 'stable/xen-block.rebase' into stable/for-linus-3.3.rebased
* stable/xen-block.rebase: (21 commits)
xen-blkback: Don't disconnect backend until state switched to XenbusStateClosed.
block: xen-blkback: use API provided by xenbus module to map rings
xen-blkback: convert hole punching to discard request on loop devices
xen/blkback: Move processing of BLKIF_OP_DISCARD from dispatch_rw_block_io
xen/blk[front|back]: Enhance discard support with secure erasing support.
xen/blk[front|back]: Squash blkif_request_rw and blkif_request_discard together
xen/blkback: Fix two races in the handling of barrier requests.
xen/blkback: Check for proper operation.
xen/blkback: Fix the inhibition to map pages when discarding sector ranges.
xen/blkback: Report VBD_WSECT (wr_sect) properly.
xen/blkback: Support 'feature-barrier' aka old-style BARRIER requests.
xen-blkfront: plug device number leak in xlblk_init() error path
xen-blkfront: If no barrier or flush is supported, use invalid operation.
xen-blkback: use kzalloc() in favor of kmalloc()+memset()
xen-blkback: fixed indentation and comments
xen-blkfront: fix a deadlock while handling discard response
xen-blkfront: Handle discard requests.
xen-blkback: Implement discard requests ('feature-discard')
xen-blkfront: add BLKIF_OP_DISCARD and discard request struct
xen/blkback: Add module alias for autoloading
...
Julia Lawall [Fri, 23 Dec 2011 17:39:29 +0000 (18:39 +0100)]
xen-gntalloc: introduce missing kfree
Error handling code following a kmalloc should free the allocated data.
Out_unlock is used on both success and failure, so free vm_priv before
jumping to that label.
A simplified version of the semantic match that finds the problem is as
follows: (http://coccinelle.lip6.fr)
// <smpl>
@r exists@
local idexpression x;
statement S;
identifier f1;
position p1,p2;
expression *ptr != NULL;
@@
x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...);
...
if (x == NULL) S
<... when != x
when != if (...) { <+...x...+> }
x->f1
...>
(
return \(0\|<+...x...+>\|ptr\);
|
return@p2 ...;
)
Daniel De Graaf [Wed, 14 Dec 2011 20:12:13 +0000 (15:12 -0500)]
xen/netback: Enable netback on HVM guests
Acked-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Daniel De Graaf [Wed, 14 Dec 2011 20:12:11 +0000 (15:12 -0500)]
xen/grant-table: Support mappings required by blkback
Add support for mappings without GNTMAP_contains_pte. This was not
supported because the unmap operation assumed that this flag was being
used; adding a parameter to the unmap operation to allow the PTE
clearing to be disabled is sufficient to make unmap capable of
supporting either mapping type.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
[v1: Fix cleanpatch warnings] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Daniel De Graaf [Wed, 14 Dec 2011 20:12:10 +0000 (15:12 -0500)]
xenbus: Use grant-table wrapper functions
For xenbus_{map,unmap}_ring to work on HVM, the grant table operations
must be set up using the gnttab_set_{map,unmap}_op functions instead of
directly populating the fields of gnttab_map_grant_ref. These functions
simply populate the structure on paravirtualized Xen; however, on HVM
they must call __pa() on vaddr when populating op->host_addr because the
hypervisor cannot directly interpret guest-virtual addresses.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
[v1: Fixed cleanpatch error] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Daniel De Graaf [Mon, 19 Dec 2011 19:55:14 +0000 (14:55 -0500)]
xenbus: Support HVM backends
Add HVM implementations of xenbus_(map,unmap)_ring_v(alloc,free) so
that ring mappings can be done without using GNTMAP_contains_pte which
is not supported on HVM. This also removes the need to use vmlist_lock
on PV by tracking the allocated xenbus rings.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
[v1: Fix compile error when XENBUS_FRONTEND is defined as module] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Thu, 2 Feb 2012 19:16:32 +0000 (14:16 -0500)]
Merge branch 'stable/drivers-3.2.rebased' into stable/for-linus-3.3.rebased
* stable/drivers-3.2.rebased:
xen: use static initializers in xen-balloon.c
xenbus: don't rely on xen_initial_domain to detect local xenstore
xenbus: Fix loopback event channel assuming domain 0
xen/pv-on-hvm:kexec: Fix implicit declaration of function 'xen_hvm_domain'
xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel
xen/pv-on-hvm kexec: update xs_wire.h:xsd_sockmsg_type from xen-unstable
xen/pv-on-hvm kexec+kdump: reset PV devices in kexec or crash kernel
xen/pv-on-hvm kexec: rebind virqs to existing eventchannel ports
xen/pv-on-hvm kexec: prevent crash in xenwatch_thread() when stale watch events arrive
Konrad Rzeszutek Wilk [Thu, 2 Feb 2012 19:12:34 +0000 (14:12 -0500)]
Merge branch 'stable/vmalloc-3.2.rebased' into stable/for-linus-3.3.rebased
* stable/vmalloc-3.2.rebased:
xen: map foreign pages for shared rings by updating the PTEs directly
net: xen-netback: use API provided by xenbus module to map rings
block: xen-blkback: use API provided by xenbus module to map rings
xen: use generic functions instead of xen_{alloc, free}_vm_area()
Konrad Rzeszutek Wilk [Thu, 2 Feb 2012 19:11:52 +0000 (14:11 -0500)]
Merge branch 'stable/bug.fixes-3.2.rebased' into stable/for-linus-3.3.rebased
* stable/bug.fixes-3.2.rebased:
xen: Remove hanging references to CONFIG_XEN_PLATFORM_PCI
xen/irq: If we fail during msi_capability_init return proper error code.
xen: remove XEN_PLATFORM_PCI config option
xen: XEN_PVHVM depends on PCI
xen/p2m/debugfs: Make type_name more obvious.
xen/p2m/debugfs: Fix potential pointer exception.
xen/enlighten: Fix compile warnings and set cx to known value.
xen/xenbus: Remove the unnecessary check.
xen/events: Don't check the info for NULL as it is already done.
xen/pci: Use 'acpi_gsi_to_irq' value unconditionally.
xen/pci: Remove 'xen_allocate_pirq_gsi'.
xen/pci: Retire unnecessary #ifdef CONFIG_ACPI
xen/pci: Move the allocation of IRQs when there are no IOAPIC's to the end
xen/pci: Squash pci_xen_initial_domain and xen_setup_pirqs together.
xen/pci: Use the xen_register_pirq for HVM and initial domain users
xen/pci: In xen_register_pirq bind the GSI to the IRQ after the hypercall.
xen/pci: Provide #ifdef CONFIG_ACPI to easy code squashing.
xen/pci: Update comments and fix empty spaces.
xen/pci: Shuffle code around.
Konrad Rzeszutek Wilk [Mon, 19 Dec 2011 20:08:15 +0000 (15:08 -0500)]
xen/xenbus-frontend: Fix compile error with randconfig
drivers/xen/xenbus/xenbus_dev_frontend.c: In function 'xenbus_init':
drivers/xen/xenbus/xenbus_dev_frontend.c:609:2: error: implicit declaration of function 'xen_domain'
Reported-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Bastian Blank [Sat, 10 Dec 2011 18:29:48 +0000 (19:29 +0100)]
xen: Add xenbus_backend device
Access for xenstored to the event channel and pre-allocated ring is
managed via xenfs. This adds its own character device featuring mmap
for the ring and an ioctl for the event channel.
Signed-off-by: Bastian Blank <waldi@debian.org> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Bastian Blank [Fri, 16 Dec 2011 16:34:33 +0000 (11:34 -0500)]
xen: Add privcmd device driver
Access to arbitrary hypercalls is currently provided via xenfs. This
adds a standard character device to handle this. The support in xenfs
remains for backward compatibility and uses the device driver code.
Signed-off-by: Bastian Blank <waldi@debian.org> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
Daniel De Graaf [Mon, 28 Nov 2011 16:49:11 +0000 (11:49 -0500)]
xen/gntalloc: fix reference counts on multi-page mappings
When a multi-page mapping of gntalloc is created, the reference counts
of all pages in the vma are incremented. However, the vma open/close
operations only adjusted the reference count of the first page in the
mapping, leaking the other pages. Store a struct in the vm_private_data
to track the original page count to properly free the pages when the
last reference to the vma is closed.
Reported-by: Anil Madhavapeddy <anil@recoil.org> Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Daniel De Graaf [Mon, 28 Nov 2011 16:49:10 +0000 (11:49 -0500)]
xen/gntalloc: release grant references on page free
gnttab_end_foreign_access_ref does not return the grant reference it is
passed to the free list; gnttab_free_grant_reference needs to be
explicitly called. While gnttab_end_foreign_access provides a wrapper
for this, it is unsuitable because it does not return errors.
Reported-by: Anil Madhavapeddy <anil@recoil.org> Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Daniel De Graaf [Mon, 28 Nov 2011 16:49:09 +0000 (11:49 -0500)]
xen/events: prevent calling evtchn_get on invalid channels
The event channel number provided to evtchn_get can be provided by
userspace, so needs to be checked against the maximum number of event
channels prior to using it to index into evtchn_to_irq.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Annie Li [Mon, 12 Dec 2011 10:15:07 +0000 (18:15 +0800)]
xen/granttable: Support transitive grants
These allow a domain A which has been granted access on a page of domain B's
memory to issue domain C with a copy-grant on the same page. This is useful
e.g. for forwarding packets between domains.
Signed-off-by: Annie Li <annie.li@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Annie Li [Mon, 12 Dec 2011 10:14:42 +0000 (18:14 +0800)]
xen/granttable: Support sub-page grants
- They can't be used to map the page (so can only be used in a GNTTABOP_copy
hypercall).
- It's possible to grant access with a finer granularity than whole pages.
- Xen guarantees that they can be revoked quickly (a normal map grant can
only be revoked with the cooperation of the domain which has been granted
access).
Signed-off-by: Annie Li <annie.li@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Luck, Tony [Wed, 30 Nov 2011 18:22:37 +0000 (10:22 -0800)]
xen/ia64: fix build breakage because of conflicting u64 guest handles
include/xen/interface/xen.h:526: error: conflicting types for ‘__guest_handle_u64’
arch/ia64/include/asm/xen/interface.h:74: error: previous declaration of ‘__guest_handle_u64’ was here
Problem introduced by "xen/granttable: Introducing grant table V2 stucture"
which added a new definition to include/xen/interface/xen.h for "u64".
Fix: delete the ia64 arch specific definition.
Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Annie Li [Tue, 22 Nov 2011 01:59:56 +0000 (09:59 +0800)]
xen/granttable: Keep code format clean
Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Annie Li <annie.li@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Annie Li [Tue, 22 Nov 2011 01:59:21 +0000 (09:59 +0800)]
xen/granttable: Grant tables V2 implementation
Receiver-side copying of packets is based on this implementation, it gives
better performance and better CPU accounting. It totally supports three types:
full-page, sub-page and transitive grants.
However this patch does not cover sub-page and transitive grants, it mainly
focus on Full-page part and implements grant table V2 interfaces corresponding
to what already exists in grant table V1, such as: grant table V2
initialization, mapping, releasing and exported interfaces.
Each guest can only supports one type of grant table type, every entry in grant
table should be the same version. It is necessary to set V1 or V2 version before
initializing the grant table.
Grant table exported interfaces of V2 are same with those of V1, Xen is
responsible to judge what grant table version guests are using in every grant
operation.
V2 fulfills the same role of V1, and it is totally backwards compitable with V1.
If dom0 support grant table V2, the guests runing on it can run with either V1
or V2.
Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Annie Li <annie.li@oracle.com>
[v1: Modified alloc_vm_area call (new parameters), indentation, and cleanpatch
warnings] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Annie Li [Tue, 22 Nov 2011 01:58:47 +0000 (09:58 +0800)]
xen/granttable: Refactor some code
Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Annie Li <annie.li@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Annie Li [Tue, 22 Nov 2011 01:58:06 +0000 (09:58 +0800)]
xen/granttable: Introducing grant table V2 stucture
This patch introduces new structures of grant table V2, grant table V2 is an
extension from V1. Grant table is shared between guest and Xen, and Xen is
responsible to do corresponding work for grant operations, such as: figure
out guest's grant table version, perform different actions based on
different grant table version, etc. Although full-page structure of V2
is different from V1, it play the same role as V1.
Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Annie Li <annie.li@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Jeremy Fitzhardinge [Fri, 18 Nov 2011 23:56:06 +0000 (15:56 -0800)]
Xen: update MAINTAINER info
No longer at Citrix, still interested in Xen.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Daniel De Graaf [Thu, 27 Oct 2011 21:58:47 +0000 (17:58 -0400)]
xen/event: Add reference counting to event channels
Event channels exposed to userspace by the evtchn module may be used by
other modules in an asynchronous manner, which requires that reference
counting be used to prevent the event channel from being closed before
the signals are delivered.
The reference count on new event channels defaults to -1 which indicates
the event channel is not referenced outside the kernel; evtchn_get fails
if called on such an event channel. The event channels made visible to
userspace by evtchn have a normal reference count.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Thu, 2 Feb 2012 18:58:45 +0000 (13:58 -0500)]
Merge branch 'in-3.1/bug.fixes' into stable/for-linus-3.3.rebased
* in-3.1/bug.fixes:
x86/paravirt: PTE updates in k(un)map_atomic need to be synchronous, regardless of lazy_mmu mode.
xen/i386: follow-up to "replace order-based range checking of M2P table by linear one"
xen/irq: Alter the locking to use a mutex instead of a spinlock.
xen/e820: if there is no dom0_mem=, don't tweak extra_pages.
Revert "xen/e820: if there is no dom0_mem=, don't tweak extra_pages."
xen/e820: if there is no dom0_mem=, don't tweak extra_pages.
xen: disable PV spinlocks on HVM
xen/smp: Warn user why they keel over - nosmp or noapic and what to use instead.
xen: x86_32: do not enable iterrupts when returning from exception in interrupt context
xen: use maximum reservation to limit amount of usable RAM
xen: Do not enable PV IPIs when vector callback not present
xen/x86: replace order-based range checking of M2P table by linear one
xen: Fix misleading WARN message at xen_release_chunk
xen: Fix printk() format in xen/setup.c
xen/grant: Fix compile warning.
xen:pvhvm: Modpost section mismatch fix
Daniel De Graaf [Thu, 27 Oct 2011 21:58:49 +0000 (17:58 -0400)]
xen/gnt{dev,alloc}: reserve event channels for notify
When using the unmap notify ioctl, the event channel used for
notification needs to be reserved to avoid it being deallocated prior to
sending the notification.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
Daniel De Graaf [Thu, 27 Oct 2011 21:58:48 +0000 (17:58 -0400)]
xen/gntalloc: Change gref_lock to a mutex
The event channel release function cannot be called under a spinlock
because it can attempt to acquire a mutex due to the event channel
reference acquired when setting up unmap notifications.
Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
David Vrabel [Wed, 26 Oct 2011 10:57:43 +0000 (11:57 +0100)]
xen: document balloon driver sysfs files
Add ABI documentation for the balloon driver's sysfs files.
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Daniel Kiper <dkiper@net-space.pl>
[v2: Added comments from Daniel] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Dan Carpenter [Thu, 26 Jan 2012 13:55:16 +0000 (16:55 +0300)]
xfs: fix acl count validation in xfs_acl_from_disk()
We applied a fix for CVE-2012-0038 fa8b18edd7 "xfs: validate acl count",
but there was a follow on patch which is not in our kernel. If count
was a negative then we could get by the new check.
From 093019cf1b18dd31b2c3b77acce4e000e2cbc9ce Mon Sep 17 00:00:00 2001
From: Xi Wang <xi.wang@gmail.com>
Date: Mon, 12 Dec 2011 21:55:52 +0000
Subject: [PATCH] xfs: fix acl count validation in xfs_acl_from_disk()
Commit fa8b18ed didn't prevent the integer overflow and possible
memory corruption. "count" can go negative and bypass the check.
Signed-off-by: Xi Wang <xi.wang@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Srinivas Eeda [Tue, 31 Jan 2012 22:37:19 +0000 (14:37 -0800)]
ocfs2: use spinlock irqsave for downconvert lock.patch
When ocfs2dc thread holds dc_task_lock spinlock and receives soft IRQ it
deadlock itself trying to get same spinlock in ocfs2_wake_downconvert_thread.
Below is the stack snippet.
The patch disables interrupts when acquiring dc_task_lock spinlock.
Chris Mason [Wed, 25 Jan 2012 18:47:40 +0000 (13:47 -0500)]
Btrfs: fix reservations in btrfs_page_mkwrite
Josef fixed btrfs_page_mkwrite to properly release reserved
extents if there was an error. But if we fail to get a reservation
and we fail to dirty the inode (for ENOSPC reasons), we'll end up
trying to release a reservation we never had.
This makes sure we only release if we were able to reserve.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Mon, 16 Jan 2012 13:13:11 +0000 (08:13 -0500)]
Btrfs: use larger system chunks
system chunks by default are very small. This makes them slightly
larger and also fixes the conditional checks to make sure we don't
allocate a billion of them at once.
Josef Bacik [Fri, 13 Jan 2012 17:09:22 +0000 (12:09 -0500)]
Btrfs: add a delalloc mutex to inodes for delalloc reservations
I was using i_mutex for this, but we're getting bogus lockdep warnings by doing
that and theres no real way to get rid of those, so just stop using i_mutex to
protect delalloc metadata reservations and use a delalloc mutex instead. This
shouldn't be contended often at all, only if you are writing and mmap writing to
the file at the same time. Thanks,
Josef Bacik [Fri, 2 Dec 2011 20:44:12 +0000 (15:44 -0500)]
Btrfs: protect orphan block rsv with spin_lock
We've been seeing warnings coming out of the orphan commit stuff forever from
ceph. Turns out it's because we're racing with checking if the orphan block
reserve is set, because we clear it outside of the spin_lock. So leave the
normal fastpath checks where they are, but take the spin_lock and _recheck_ to
make sure we haven't had an orphan block rsv added in the meantime. Then clear
the root's orphan block rsv and release the lock. With this patch a user said
the warnings went away and they usually showed up pretty soon after he started
ceph. Thanks,
Josef Bacik [Fri, 13 Jan 2012 00:10:12 +0000 (19:10 -0500)]
Btrfs: don't call btrfs_throttle in file write
Btrfs_throttle will make us wait if there is a currently committing transaction
until we can open new transactions, which is ridiculous since we don't actually
start any transactions within the file write path anyway, so all this does is
introduce big latencies if we have a sync/fsync heavy workload going on while
somebody else is trying to do work. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 45a8090e626ab470c91142954431a93846030b0d)
Josef Bacik [Fri, 13 Jan 2012 00:10:12 +0000 (19:10 -0500)]
Btrfs: release space on error in page_mkwrite
If updating the inode gave us an ENOSPC we were just returning in page_mkwrite,
which is a problem since we make our reservation right before trying to update
the inode, so fix the out label so that we actually free our reservation.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit ec39e180fd3188c983c94603634bfcd019f42ae7)
This is because of the wrong if condition, which is used to check if we should
subtract the bytes of the dropped range from i_blocks/i_bytes of i-node or not.
When we truncate a compressed extent, btrfs substracts the bytes of the whole
extent, it's wrong. We should substract the real size that we truncate, no
matter it is a compressed extent or not. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit f70a9a6b94af86fca069a7552ab672c31b457786)
Josef Bacik [Fri, 13 Jan 2012 00:10:12 +0000 (19:10 -0500)]
Btrfs: do not use btrfs_end_transaction_throttle everywhere
A user reported a problem where things like open with O_CREAT would take up to
30 seconds when he had nfs activity on the same mount. This is because all of
our quick metadata operations, like create, symlink etc all do
btrfs_end_transaction_throttle, which if the transaction is blocked will wait
for the commit to complete before it returns. This adds a ridiculous amount of
latency and isn't really needed. The normal btrfs_end_transaction will mark the
transaction as blocked and wake the transaction kthread up if it thinks the
transaction needs to end (this being in the running out of global reserve space
scenario), and this is all that is really needed since we've already done
everything we're going to do, we just need to return. This should help people
with the latency they were seeing when using synchronous heavy workloads.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 7ad85bb76a61801362701b77c5cee5aa09f35369)
Li Zefan [Wed, 7 Dec 2011 03:38:24 +0000 (11:38 +0800)]
Btrfs: fix possible deadlock when opening a seed device
The correct lock order is uuid_mutex -> volume_mutex -> chunk_mutex,
but when we mount a filesystem which has backing seed devices, we have
this lock chain:
Since seed device is readonly, there's no usable space in the filesystem.
Afterwards we add a sprout device to it, and the kernel creates a METADATA
block group and a SYSTEM block group where comes free space we can reserve,
but we still get revervation failure because the global block_rsv hasn't
been updated accordingly.
Li Zefan [Thu, 29 Dec 2011 06:47:27 +0000 (14:47 +0800)]
Btrfs: rewrite btrfs_trim_block_group()
There are various bugs in block group trimming:
- It may trim from offset smaller than user-specified offset.
- It may trim beyond user-specified range.
- It may leak free space for extents smaller than specified minlen.
- It may truncate the last trimmed extent thus leak free space.
- With mixed extents+bitmaps, some extents may not be trimmed.
- With mixed extents+bitmaps, some bitmaps may not be trimmed (even
none will be trimmed). Even for those trimmed, not all the free space
in the bitmaps will be trimmed.
I rewrite btrfs_trim_block_group() and break it into two functions.
One is to trim extents only, and the other is to trim bitmaps only.
Before patching:
# fstrim -v /mnt/
/mnt/: 1496465408 bytes were trimmed
After patching:
# fstrim -v /mnt/
/mnt/: 2193768448 bytes were trimmed
Li Zefan [Thu, 1 Dec 2011 06:06:42 +0000 (14:06 +0800)]
Btrfs: simplfy calculation of stripe length for discard operation
For btrfs raid, while discarding a range of space, we'll need to know
the start offset and length to discard for each device, and it's done
in btrfs_map_block().
However the calculation is a bit complex for raid0 and raid10, so I
reimplement it based on a fact that:
Li Zefan [Thu, 1 Dec 2011 04:55:47 +0000 (12:55 +0800)]
Btrfs: don't pre-allocate btrfs bio
We pre-allocate a btrfs bio with fixed size, and then may re-allocate
memory if we find stripes are bigger than the fixed size. But this
pre-allocation is not necessary.
Also we don't have to calcuate the stripe number twice.
Alexandre Oliva [Fri, 14 Oct 2011 15:10:36 +0000 (12:10 -0300)]
Btrfs: revamp clustered allocation logic
Parameterize clusters on minimum total size, minimum chunk size and
minimum contiguous size for at least one chunk, without limits on
cluster, window or gap sizes. Don't tolerate any fragmentation for
SSD_SPREAD; accept it for metadata, but try to keep data dense.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br> Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 1bb91902dc90e25449893e693ad45605cb08fbe5)
Alexandre Oliva [Mon, 28 Nov 2011 14:36:17 +0000 (12:36 -0200)]
Btrfs: don't set up allocation result twice
We store the allocation start and length twice in ins, once right
after the other, but with intervening calls that may prevent the
duplicate from being optimized out by the compiler. Remove one of the
assignments.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br> Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit fc7c1077ceb99c35e5f9d0ce03dc7740565bb2bf)
Alexandre Oliva [Mon, 12 Dec 2011 06:48:19 +0000 (04:48 -0200)]
Btrfs: test free space only for unclustered allocation
Since the clustered allocation may be taking extents from a different
block group, there's no point in spin-locking and testing the current
block group free space before attempting to allocate space from a
cluster, even more so when we might refrain from even trying the
cluster in the current block group because, after the cluster was set
up, not enough free space remained. Furthermore, cluster creation
attempts fail fast when the block group doesn't have enough free
space, so the test was completely superfluous.
I've move the free space test past the cluster allocation attempt,
where it is more useful, and arranged for a cluster in the current
block group to be released before trying an unclustered allocation,
when we reach the LOOP_NO_EMPTY_SIZE stage, so that the free space in
the cluster stands a chance of being combined with additional free
space in the block group so as to succeed in the allocation attempt.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br> Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit a5f6f719a5cd7caeee8ed8137cf3f94c3bbebc65)
Chris Mason [Fri, 6 Jan 2012 20:41:34 +0000 (15:41 -0500)]
Btrfs: lower the bar for chunk allocation
The chunk allocation code has tried to keep a pretty tight lid on creating new
metadata chunks. This is partially because in the past the reservation
code didn't give us an accurate idea of how much space was being used.
The new code is much more accurate, so we're able to get rid of some of these
checks.
Chris Mason [Fri, 6 Jan 2012 20:23:57 +0000 (15:23 -0500)]
Btrfs: run chunk allocations while we do delayed refs
Btrfs tries to batch extent allocation tree changes to improve performance
and reduce metadata trashing. But it doesn't allocate new metadata chunks
while it is doing allocations for the extent allocation tree.
This commit changes the delayed refence code to do chunk allocations if we're
getting low on room. It prevents crashes and improves performance.
Al Viro [Fri, 23 Dec 2011 12:58:13 +0000 (07:58 -0500)]
Btrfs: call d_instantiate after all ops are setup
This closes races where btrfs is calling d_instantiate too soon during
inode creation. All of the callers of btrfs_add_nondir are updated to
instantiate after the inode is fully setup in memory.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Chris Mason <chris.mason@oracle.com>
(cherry picked from commit 08c422c27f855d27b0b3d9fa30ebd938d4ae6f1f)
Chris Mason [Fri, 23 Dec 2011 12:53:00 +0000 (07:53 -0500)]
Btrfs: fix worker lock misuse in find_worker
Dan Carpenter noticed that we were doing a double unlock on the worker
lock, and sometimes picking a worker thread without the lock held.
This fixes both errors.
Signed-off-by: Chris Mason <chris.mason@oracle.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
(cherry picked from commit 8d532b2afb2eacc84588db709ec280a3d1219be3)
Konrad Rzeszutek Wilk [Tue, 24 Jan 2012 21:55:29 +0000 (16:55 -0500)]
xen/config: turn CONFIG_XEN_DEBUG_FS off.
That option makes the Xen spinlock code (xen/spinlock.c) accumulate
statistics about how many locks taken, time in slowpath, etc.
Good information during debugging but not in production.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Maxim Uvarov [Mon, 23 Jan 2012 20:08:00 +0000 (12:08 -0800)]
proc: clean up and fix /proc/<pid>/mem handling
Orabug: 13618927
CVE-2012-0056
Jüri Aedla reported that the /proc/<pid>/mem handling really isn't very
robust, and it also doesn't match the permission checking of any of the
other related files.
This changes it to do the permission checks at open time, and instead of
tracking the process, it tracks the VM at the time of the open. That
simplifies the code a lot, but does mean that if you hold the file
descriptor open over an execve(), you'll continue to read from the _old_
VM.
That is different from our previous behavior, but much simpler. If
somebody actually finds a load where this matters, we'll need to revert
this commit.
I suspect that nobody will ever notice - because the process mapping
addresses will also have changed as part of the execve. So you cannot
actually usefully access the fd across a VM change simply because all
the offsets for IO would have changed too.
Reported-by: Jüri Aedla <asd@ut.ee> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Conflicts:
Maxim Uvarov [Sat, 21 Jan 2012 01:45:24 +0000 (17:45 -0800)]
add __init arguments to init functions
Fix following issues:
WARNING: vmlinux.o(.text+0x3aba): Section mismatch in reference from the function xen_align_and_add_e820_region() to the function .init.text:e820_add_region()
The function xen_align_and_add_e820_region() references
the function __init e820_add_region().
This is often because xen_align_and_add_e820_region lacks a __init
annotation or the annotation of e820_add_region is wrong.
WARNING: vmlinux.o(.text+0x2e9ec): Section mismatch in reference from the function acpi_map_cpu2node() to the variable .cpuinit.data:__apicid_to_node
The function acpi_map_cpu2node() references
the variable __cpuinitdata __apicid_to_node.
This is often because acpi_map_cpu2node lacks a __cpuinitdata
annotation or the annotation of __apicid_to_node is wrong.
WARNING: vmlinux.o(.text+0x2e9f1): Section mismatch in reference from the function acpi_map_cpu2node() to the function .cpuinit.text:numa_set_node()
The function acpi_map_cpu2node() references
the function __cpuinit numa_set_node().
This is often because acpi_map_cpu2node lacks a __cpuinit
annotation or the annotation of numa_set_node is wrong.
WARNING: vmlinux.o(.text+0x3f9b4): Section mismatch in reference from the function enable_iommus() to the function .init.text:iommu_set_device_table()
The function enable_iommus() references
the function __init iommu_set_device_table().
This is often because enable_iommus lacks a __init
annotation or the annotation of iommu_set_device_table is wrong.
Maxim Uvarov [Sun, 15 Jan 2012 20:08:20 +0000 (12:08 -0800)]
hpwdt: clean up set_memory_x call for 32 bit
1. addess has to be page aligned.
2. set_memory_x uses page size argument, not size.
Bug causes with following commit:
commit da28179b4e90dda56912ee825c7eaa62fc103797
Author: Mingarelli, Thomas <Thomas.Mingarelli@hp.com>
Date: Mon Nov 7 10:59:00 2011 +0100
watchdog: hpwdt: Changes to handle NX secure bit in 32bit path