]> www.infradead.org Git - users/willy/linux.git/log
users/willy/linux.git
14 years agoKVM: SVM: Add function to recalculate intercept masks
Joerg Roedel [Tue, 30 Nov 2010 17:03:56 +0000 (18:03 +0100)]
KVM: SVM: Add function to recalculate intercept masks

This patch adds a function to recalculate the effective
intercepts masks when the vcpu is in guest-mode and either
the host or the guest intercept masks change.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: X86: Don't report L2 emulation failures to user-space
Joerg Roedel [Mon, 29 Nov 2010 16:51:49 +0000 (17:51 +0100)]
KVM: X86: Don't report L2 emulation failures to user-space

This patch prevents that emulation failures which result
from emulating an instruction for an L2-Guest results in
being reported to userspace.
Without this patch a malicious L2-Guest would be able to
kill the L1 by triggering a race-condition between an vmexit
and the instruction emulator.
With this patch the L2 will most likely only kill itself in
this situation.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: SVM: Make Use of the generic guest-mode functions
Joerg Roedel [Mon, 29 Nov 2010 16:51:48 +0000 (17:51 +0100)]
KVM: SVM: Make Use of the generic guest-mode functions

This patch replaces the is_nested logic in the SVM module
with the generic notion of guest-mode.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: X86: Introduce generic guest-mode representation
Joerg Roedel [Mon, 29 Nov 2010 16:51:47 +0000 (17:51 +0100)]
KVM: X86: Introduce generic guest-mode representation

This patch introduces a generic representation of guest-mode
fpr a vcpu. This currently only exists in the SVM code.
Having this representation generic will help making the
non-svm code aware of nesting when this is necessary.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Pull extra page fault information into struct x86_exception
Avi Kivity [Mon, 29 Nov 2010 14:12:30 +0000 (16:12 +0200)]
KVM: Pull extra page fault information into struct x86_exception

Currently page fault cr2 and nesting infomation are carried outside
the fault data structure.  Instead they are placed in the vcpu struct,
which results in confusion as global variables are manipulated instead
of passing parameters.

Fix this issue by adding address and nested fields to struct x86_exception,
so this struct can carry all information associated with a fault.

Signed-off-by: Avi Kivity <avi@redhat.com>
Tested-by: Joerg Roedel <joerg.roedel@amd.com>
Tested-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Push struct x86_exception into walk_addr()
Avi Kivity [Mon, 22 Nov 2010 15:53:27 +0000 (17:53 +0200)]
KVM: Push struct x86_exception into walk_addr()

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Push struct x86_exception info the various gva_to_gpa variants
Avi Kivity [Mon, 22 Nov 2010 15:53:26 +0000 (17:53 +0200)]
KVM: Push struct x86_exception info the various gva_to_gpa variants

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: x86 emulator: simplify exception generation
Avi Kivity [Mon, 22 Nov 2010 15:53:25 +0000 (17:53 +0200)]
KVM: x86 emulator: simplify exception generation

Immediately after we generate an exception, we want a X86EMUL_PROPAGATE_FAULT
constant, so return it from the generation functions.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: x86 emulator: tighen up ->read_std() and ->write_std() error checks
Avi Kivity [Mon, 22 Nov 2010 15:53:24 +0000 (17:53 +0200)]
KVM: x86 emulator: tighen up ->read_std() and ->write_std() error checks

Instead of checking for X86EMUL_PROPAGATE_FAULT, check for any error,
making the callers more reliable.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: x86 emulator: drop dead pf injection in emulate_popf()
Avi Kivity [Mon, 22 Nov 2010 15:53:23 +0000 (17:53 +0200)]
KVM: x86 emulator: drop dead pf injection in emulate_popf()

If rc == X86EMUL_PROPAGATE_FAULT, we would have returned earlier.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: x86 emulator: make emulator memory callbacks return full exception
Avi Kivity [Mon, 22 Nov 2010 15:53:22 +0000 (17:53 +0200)]
KVM: x86 emulator: make emulator memory callbacks return full exception

This way, they can return #GP, not just #PF.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: x86 emulator: introduce struct x86_exception to communicate faults
Avi Kivity [Mon, 22 Nov 2010 15:53:21 +0000 (17:53 +0200)]
KVM: x86 emulator: introduce struct x86_exception to communicate faults

Introduce a structure that can contain an exception to be passed back
to main kvm code.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: MMU: delay flush all tlbs on sync_page path
Xiao Guangrong [Tue, 23 Nov 2010 03:13:00 +0000 (11:13 +0800)]
KVM: MMU: delay flush all tlbs on sync_page path

Quote from Avi:
| I don't think we need to flush immediately; set a "tlb dirty" bit somewhere
| that is cleareded when we flush the tlb.  kvm_mmu_notifier_invalidate_page()
| can consult the bit and force a flush if set.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: MMU: abstract invalid guest pte mapping
Xiao Guangrong [Tue, 23 Nov 2010 03:08:42 +0000 (11:08 +0800)]
KVM: MMU: abstract invalid guest pte mapping

Introduce a common function to map invalid gpte

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: MMU: remove 'clear_unsync' parameter
Xiao Guangrong [Fri, 19 Nov 2010 09:04:03 +0000 (17:04 +0800)]
KVM: MMU: remove 'clear_unsync' parameter

Remove it since we can judge it by using sp->unsync

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: MMU: rename 'reset_host_protection' to 'host_writable'
Lai Jiangshan [Fri, 19 Nov 2010 09:03:22 +0000 (17:03 +0800)]
KVM: MMU: rename 'reset_host_protection' to 'host_writable'

Rename it to fit its sense better

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: MMU: don't drop spte if overwrite it from W to RO
Xiao Guangrong [Fri, 19 Nov 2010 09:02:35 +0000 (17:02 +0800)]
KVM: MMU: don't drop spte if overwrite it from W to RO

We just need flush tlb if overwrite a writable spte with a read-only one.

And we should move this operation to set_spte() for sync_page path

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: MMU: fix forgot flush tlbs on sync_page path
Xiao Guangrong [Fri, 19 Nov 2010 09:01:40 +0000 (17:01 +0800)]
KVM: MMU: fix forgot flush tlbs on sync_page path

We should flush all tlbs after drop spte on sync_page path since

Quote from Avi:
| sync_page
| drop_spte
| kvm_mmu_notifier_invalidate_page
| kvm_unmap_rmapp
| spte doesn't exist -> no flush
| page is freed
| guest can write into freed page?

KVM-Stable-Tag.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: PPC: Fix compile warning
Alexander Graf [Thu, 25 Nov 2010 09:25:44 +0000 (10:25 +0100)]
KVM: PPC: Fix compile warning

KVM compilation fails with the following warning:

include/linux/kvm_host.h: In function 'kvm_irq_routing_update':
include/linux/kvm_host.h:679:2: error: 'struct kvm' has no member named 'irq_routing'

That function is only used and reasonable to have on systems that implement
an in-kernel interrupt chip. PPC doesn't.

Fix by #ifdef'ing it out when no irqchip is available.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: Add instruction-set-specific exit qualifications to kvm_exit trace
Avi Kivity [Thu, 18 Nov 2010 11:09:54 +0000 (13:09 +0200)]
KVM: Add instruction-set-specific exit qualifications to kvm_exit trace

The exit reason alone is insufficient to understand exactly why an exit
occured; add ISA-specific trace parameters for additional information.

Because fetching these parameters is expensive on vmx, and because these
parameters are fetched even if tracing is disabled, we fetch the
parameters via a callback instead of as traditional trace arguments.

Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: Record instruction set in kvm_exit tracepoint
Avi Kivity [Wed, 17 Nov 2010 16:44:19 +0000 (18:44 +0200)]
KVM: Record instruction set in kvm_exit tracepoint

exit_reason's meaning depend on the instruction set; record it so a trace
taken on one machine can be interpreted on another.

Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: fast-path msi injection with irqfd
Michael S. Tsirkin [Thu, 18 Nov 2010 17:09:08 +0000 (19:09 +0200)]
KVM: fast-path msi injection with irqfd

Store irq routing table pointer in the irqfd object,
and use that to inject MSI directly without bouncing out to
a kernel thread.

While we touch this structure, rearrange irqfd fields to make fastpath
better packed for better cache utilization.

This also adds some comments about locking rules and rcu usage in code.

Some notes on the design:
- Use pointer into the rt instead of copying an entry,
  to make it possible to use rcu, thus side-stepping
  locking complexities.  We also save some memory this way.
- Old workqueue code is still used for level irqs.
  I don't think we DTRT with level anyway, however,
  it seems easier to keep the code around as
  it has been thought through and debugged, and fix level later than
  rip out and re-instate it later.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: VMX: Fold __vmx_vcpu_run() into vmx_vcpu_run()
Avi Kivity [Thu, 18 Nov 2010 11:12:52 +0000 (13:12 +0200)]
KVM: VMX: Fold __vmx_vcpu_run() into vmx_vcpu_run()

cea15c2 ("KVM: Move KVM context switch into own function") split vmx_vcpu_run()
to prevent multiple copies of the context switch from being generated (causing
problems due to a label).  This patch folds them back together again and adds
the __noclone attribute to prevent the label from being duplicated.

Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: x86 emulator: do not perform address calculations on linear addresses
Avi Kivity [Wed, 17 Nov 2010 13:28:22 +0000 (15:28 +0200)]
KVM: x86 emulator: do not perform address calculations on linear addresses

Linear addresses are supposed to already have segment checks performed on them;
if we play with these addresses the checks become invalid.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: x86 emulator: preserve an operand's segment identity
Avi Kivity [Wed, 17 Nov 2010 13:28:21 +0000 (15:28 +0200)]
KVM: x86 emulator: preserve an operand's segment identity

Currently the x86 emulator converts the segment register associated with
an operand into a segment base which is added into the operand address.
This loss of information results in us not doing segment limit checks properly.

Replace struct operand's addr.mem field by a segmented_address structure
which holds both the effetive address and segment.  This will allow us to
do the limit check at the point of access.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: x86 emulator: drop DPRINTF()
Avi Kivity [Wed, 17 Nov 2010 11:40:51 +0000 (13:40 +0200)]
KVM: x86 emulator: drop DPRINTF()

Failed emulation is reported via a tracepoint; the cmps printk is pointless.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: x86 emulator: drop unused #ifndef __KERNEL__
Avi Kivity [Wed, 17 Nov 2010 11:40:50 +0000 (13:40 +0200)]
KVM: x86 emulator: drop unused #ifndef __KERNEL__

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: VMX: Inform user about INTEL_TXT dependency
Shane Wang [Wed, 17 Nov 2010 03:40:17 +0000 (11:40 +0800)]
KVM: VMX: Inform user about INTEL_TXT dependency

Inform user to either disable TXT in the BIOS or do TXT launch
with tboot before enabling KVM since some BIOSes do not set
FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX bit when TXT is enabled.

Signed-off-by: Shane Wang <shane.wang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: rename hardware_[dis|en]able() to *_nolock() and add locking wrappers
Takuya Yoshikawa [Tue, 16 Nov 2010 08:37:41 +0000 (17:37 +0900)]
KVM: rename hardware_[dis|en]able() to *_nolock() and add locking wrappers

The naming convension of hardware_[dis|en]able family is little bit confusing
because only hardware_[dis|en]able_all are using _nolock suffix.

Renaming current hardware_[dis|en]able() to *_nolock() and using
hardware_[dis|en]able() as wrapper functions which take kvm_lock for them
reduces extra confusion.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: take kvm_lock for hardware_disable() during cpu hotplug
Takuya Yoshikawa [Tue, 16 Nov 2010 08:35:02 +0000 (17:35 +0900)]
KVM: take kvm_lock for hardware_disable() during cpu hotplug

In kvm_cpu_hotplug(), only CPU_STARTING case is protected by kvm_lock.
This patch adds missing protection for CPU_DYING case.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: MMU: don't mark spte notrap if reserved bit set
Xiao Guangrong [Wed, 17 Nov 2010 04:11:41 +0000 (12:11 +0800)]
KVM: MMU: don't mark spte notrap if reserved bit set

If reserved bit is set, we need inject the #PF with PFEC.RSVD=1,
but shadow_notrap_nonpresent_pte injects #PF with PFEC.RSVD=0 only

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Document device assigment API
Jan Kiszka [Tue, 16 Nov 2010 21:30:07 +0000 (22:30 +0100)]
KVM: Document device assigment API

Adds API documentation for KVM_[DE]ASSIGN_PCI_DEVICE,
KVM_[DE]ASSIGN_DEV_IRQ, KVM_SET_GSI_ROUTING, KVM_ASSIGN_SET_MSIX_NR, and
KVM_ASSIGN_SET_MSIX_ENTRY.

Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Clean up kvm_vm_ioctl_assigned_device
Jan Kiszka [Tue, 16 Nov 2010 21:30:06 +0000 (22:30 +0100)]
KVM: Clean up kvm_vm_ioctl_assigned_device

Any arch not supporting device assigment will also not build
assigned-dev.c. So testing for KVM_CAP_DEVICE_DEASSIGNMENT is pointless.
KVM_CAP_ASSIGN_DEV_IRQ is unconditinally set. Moreover, add a default
case for dispatching the ioctl.

Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Save/restore state of assigned PCI device
Jan Kiszka [Tue, 16 Nov 2010 21:30:05 +0000 (22:30 +0100)]
KVM: Save/restore state of assigned PCI device

The guest may change states that pci_reset_function does not touch. So
we better save/restore the assigned device across guest usage.

Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Refactor IRQ names of assigned devices
Jan Kiszka [Tue, 16 Nov 2010 21:30:04 +0000 (22:30 +0100)]
KVM: Refactor IRQ names of assigned devices

Cosmetic change, but it helps to correlate IRQs with PCI devices.

Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Switch assigned device IRQ forwarding to threaded handler
Jan Kiszka [Tue, 16 Nov 2010 21:30:03 +0000 (22:30 +0100)]
KVM: Switch assigned device IRQ forwarding to threaded handler

This improves the IRQ forwarding for assigned devices: By using the
kernel's threaded IRQ scheme, we can get rid of the latency-prone work
queue and simplify the code in the same run.

Moreover, we no longer have to hold assigned_dev_lock while raising the
guest IRQ, which can be a lenghty operation as we may have to iterate
over all VCPUs. The lock is now only used for synchronizing masking vs.
unmasking of INTx-type IRQs, thus is renames to intx_lock.

Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Clear assigned guest IRQ on release
Jan Kiszka [Tue, 16 Nov 2010 21:30:02 +0000 (22:30 +0100)]
KVM: Clear assigned guest IRQ on release

When we deassign a guest IRQ, clear the potentially asserted guest line.
There might be no chance for the guest to do this, specifically if we
switch from INTx to MSI mode.

Acked-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Mask KVM_GET_SUPPORTED_CPUID data with Linux cpuid info
Avi Kivity [Tue, 9 Nov 2010 14:15:43 +0000 (16:15 +0200)]
KVM: Mask KVM_GET_SUPPORTED_CPUID data with Linux cpuid info

This allows Linux to mask cpuid bits if, for example, nx is enabled on only
some cpus.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: SVM: Replace svm_has() by standard Linux cpuid accessors
Avi Kivity [Tue, 9 Nov 2010 14:15:42 +0000 (16:15 +0200)]
KVM: SVM: Replace svm_has() by standard Linux cpuid accessors

Instead of querying cpuid directly, use the Linux accessors (boot_cpu_has,
etc.).  This allows the things like the clearcpuid kernel command line to
work (when it's fixed wrt scattered cpuid bits).

Acked-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: MMU: fix apf prefault if nested guest is enabled
Xiao Guangrong [Fri, 12 Nov 2010 06:49:55 +0000 (14:49 +0800)]
KVM: MMU: fix apf prefault if nested guest is enabled

If apf is generated in L2 guest and is completed in L1 guest, it will
prefault this apf in L1 guest's mmu context.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: MMU: support apf for nonpaing guest
Xiao Guangrong [Fri, 12 Nov 2010 06:49:11 +0000 (14:49 +0800)]
KVM: MMU: support apf for nonpaing guest

Let's support apf for nonpaing guest

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: MMU: clear apfs if page state is changed
Xiao Guangrong [Fri, 12 Nov 2010 06:47:01 +0000 (14:47 +0800)]
KVM: MMU: clear apfs if page state is changed

If CR0.PG is changed, the page fault cann't be avoid when the prefault address
is accessed later

And it also fix a bug: it can retry a page enabled #PF in page disabled context
if mmu is shadow page

This idear is from Gleb Natapov

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: MMU: fix missing post sync audit
Xiao Guangrong [Fri, 12 Nov 2010 06:46:08 +0000 (14:46 +0800)]
KVM: MMU: fix missing post sync audit

Add AUDIT_POST_SYNC audit for long mode shadow page

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: Clean up vm creation and release
Jan Kiszka [Tue, 9 Nov 2010 16:02:49 +0000 (17:02 +0100)]
KVM: Clean up vm creation and release

IA64 support forces us to abstract the allocation of the kvm structure.
But instead of mixing this up with arch-specific initialization and
doing the same on destruction, split both steps. This allows to move
generic destruction calls into generic code.

It also fixes error clean-up on failures of kvm_create_vm for IA64.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: x86: Makefile clean up
Tracey Dent [Sat, 6 Nov 2010 18:52:58 +0000 (14:52 -0400)]
KVM: x86: Makefile clean up

Changed makefile to use the ccflags-y option instead of EXTRA_CFLAGS.

Signed-off-by: Tracey Dent <tdent48227@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: remove unused function declaration
Xiao Guangrong [Thu, 4 Nov 2010 10:29:42 +0000 (18:29 +0800)]
KVM: remove unused function declaration

Remove the declaration of kvm_mmu_set_base_ptes()

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: Refactor srcu struct release on early errors
Jan Kiszka [Tue, 9 Nov 2010 11:42:12 +0000 (12:42 +0100)]
KVM: Refactor srcu struct release on early errors

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: VMX: Disallow NMI while blocked by STI
Avi Kivity [Mon, 1 Nov 2010 21:20:48 +0000 (23:20 +0200)]
KVM: VMX: Disallow NMI while blocked by STI

While not mandated by the spec, Linux relies on NMI being blocked by an
IF-enabling STI.  VMX also refuses to enter a guest in this state, at
least on some implementations.

Disallow NMI while blocked by STI by checking for the condition, and
requesting an interrupt window exit if it occurs.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: fix the race while wakeup all pv guest
Xiao Guangrong [Mon, 1 Nov 2010 09:03:44 +0000 (17:03 +0800)]
KVM: fix the race while wakeup all pv guest

In kvm_async_pf_wakeup_all(), we add a dummy apf to vcpu->async_pf.done
without holding vcpu->async_pf.lock, it will break if we are handling apfs
at this time.

Also use 'list_empty_careful()' instead of 'list_empty()'

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: handle more completed apfs if possible
Xiao Guangrong [Tue, 2 Nov 2010 09:35:35 +0000 (17:35 +0800)]
KVM: handle more completed apfs if possible

If it's no need to inject async #PF to PV guest we can handle
more completed apfs at one time, so we can retry guest #PF
as early as possible

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: avoid unnecessary wait for a async pf
Xiao Guangrong [Mon, 1 Nov 2010 09:01:28 +0000 (17:01 +0800)]
KVM: avoid unnecessary wait for a async pf

In current code, it checks async pf completion out of the wait context,
like this:

if (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
    !vcpu->arch.apf.halted)
r = vcpu_enter_guest(vcpu);
else {
......
kvm_vcpu_block(vcpu)
 ^- waiting until 'async_pf.done' is not empty
}

kvm_check_async_pf_completion(vcpu)
 ^- delete list from async_pf.done

So, if we check aysnc pf completion first, it can be blocked at
kvm_vcpu_block

Fixed by mark the vcpu is unhalted in kvm_check_async_pf_completion()
path

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: fix searching async gfn in kvm_async_pf_gfn_slot
Xiao Guangrong [Mon, 1 Nov 2010 09:00:30 +0000 (17:00 +0800)]
KVM: fix searching async gfn in kvm_async_pf_gfn_slot

Don't search later slots if the slot is empty

Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: cleanup async_pf tracepoints
Xiao Guangrong [Mon, 1 Nov 2010 08:59:39 +0000 (16:59 +0800)]
KVM: cleanup async_pf tracepoints

Use 'DECLARE_EVENT_CLASS' to cleanup async_pf tracepoints

Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: fix tracing kvm_try_async_get_page
Xiao Guangrong [Mon, 1 Nov 2010 08:58:43 +0000 (16:58 +0800)]
KVM: fix tracing kvm_try_async_get_page

Tracing 'async' and *pfn is useless, since 'async' is always true,
and '*pfn' is always "fault_pfn'

We can trace 'gva' and 'gfn' instead, it can help us to see the
life-cycle of an async_pf

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: replace vmalloc and memset with vzalloc
Takuya Yoshikawa [Tue, 2 Nov 2010 01:49:34 +0000 (10:49 +0900)]
KVM: replace vmalloc and memset with vzalloc

Let's use newly introduced vzalloc().

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: handle exit due to INVD in VMX
Gleb Natapov [Mon, 1 Nov 2010 13:35:01 +0000 (15:35 +0200)]
KVM: handle exit due to INVD in VMX

Currently the exit is unhandled, so guest halts with error if it tries
to execute INVD instruction. Call into emulator when INVD instruction
is executed by a guest instead. This instruction is not needed by ordinary
guests, but firmware (like OpenBIOS) use it and fail.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: x86: Avoid issuing wbinvd twice
Jan Kiszka [Mon, 1 Nov 2010 13:01:29 +0000 (14:01 +0100)]
KVM: x86: Avoid issuing wbinvd twice

Micro optimization to avoid calling wbinvd twice on the CPU that has to
emulate it. As we might be preempted between smp_call_function_many and
the local wbinvd, the cache might be filled again so that real work
could be done uselessly.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: get rid of warning within kvm_dev_ioctl_create_vm
Heiko Carstens [Wed, 27 Oct 2010 15:22:10 +0000 (17:22 +0200)]
KVM: get rid of warning within kvm_dev_ioctl_create_vm

Fixes this:

  CC      arch/s390/kvm/../../../virt/kvm/kvm_main.o
arch/s390/kvm/../../../virt/kvm/kvm_main.c: In function 'kvm_dev_ioctl_create_vm':
arch/s390/kvm/../../../virt/kvm/kvm_main.c:1828:10: warning: unused variable 'r'

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: add cast within kvm_clear_guest_page to fix warning
Heiko Carstens [Wed, 27 Oct 2010 15:21:21 +0000 (17:21 +0200)]
KVM: add cast within kvm_clear_guest_page to fix warning

Fixes this:

  CC      arch/s390/kvm/../../../virt/kvm/kvm_main.o
arch/s390/kvm/../../../virt/kvm/kvm_main.c: In function 'kvm_clear_guest_page':
arch/s390/kvm/../../../virt/kvm/kvm_main.c:1224:2: warning: passing argument 3 of 'kvm_write_guest_page' makes pointer from integer without a cast
arch/s390/kvm/../../../virt/kvm/kvm_main.c:1185:5: note: expected 'const void *' but argument is of type 'long unsigned int'

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: use kmalloc() for small dirty bitmaps
Takuya Yoshikawa [Mon, 1 Nov 2010 05:36:09 +0000 (14:36 +0900)]
KVM: use kmalloc() for small dirty bitmaps

Currently we are using vmalloc() for all dirty bitmaps even if
they are small enough, say less than K bytes.

We use kmalloc() if dirty bitmap size is less than or equal to
PAGE_SIZE so that we can avoid vmalloc area usage for VGA.

This will also make the logging start/stop faster.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: pre-allocate one more dirty bitmap to avoid vmalloc()
Takuya Yoshikawa [Wed, 27 Oct 2010 09:23:54 +0000 (18:23 +0900)]
KVM: pre-allocate one more dirty bitmap to avoid vmalloc()

Currently x86's kvm_vm_ioctl_get_dirty_log() needs to allocate a bitmap by
vmalloc() which will be used in the next logging and this has been causing
bad effect to VGA and live-migration: vmalloc() consumes extra systime,
triggers tlb flush, etc.

This patch resolves this issue by pre-allocating one more bitmap and switching
between two bitmaps during dirty logging.

Performance improvement:
  I measured performance for the case of VGA update by trace-cmd.
  The result was 1.5 times faster than the original one.

  In the case of live migration, the improvement ratio depends on the workload
  and the guest memory size. In general, the larger the memory size is the more
  benefits we get.

Note:
  This does not change other architectures's logic but the allocation size
  becomes twice. This will increase the actual memory consumption only when
  the new size changes the number of pages allocated by vmalloc().

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: introduce wrapper functions for creating/destroying dirty bitmaps
Takuya Yoshikawa [Wed, 27 Oct 2010 09:22:19 +0000 (18:22 +0900)]
KVM: introduce wrapper functions for creating/destroying dirty bitmaps

This makes it easy to change the way of allocating/freeing dirty bitmaps.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: x86: trace "exit to userspace" event
Gleb Natapov [Sun, 24 Oct 2010 14:49:08 +0000 (16:49 +0200)]
KVM: x86: trace "exit to userspace" event

Add tracepoint for userspace exit.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: propagate fault r/w information to gup(), allow read-only memory
Marcelo Tosatti [Fri, 22 Oct 2010 16:18:18 +0000 (14:18 -0200)]
KVM: propagate fault r/w information to gup(), allow read-only memory

As suggested by Andrea, pass r/w error code to gup(), upgrading read fault
to writable if host pte allows it.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: MMU: flush TLBs on writable -> read-only spte overwrite
Marcelo Tosatti [Fri, 22 Oct 2010 16:18:17 +0000 (14:18 -0200)]
KVM: MMU: flush TLBs on writable -> read-only spte overwrite

This can happen in the following scenario:

vcpu0 vcpu1
read fault
gup(.write=0)
gup(.write=1)
reuse swap cache, no COW
set writable spte
use writable spte
set read-only spte

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: MMU: remove kvm_mmu_set_base_ptes
Marcelo Tosatti [Fri, 22 Oct 2010 16:18:16 +0000 (14:18 -0200)]
KVM: MMU: remove kvm_mmu_set_base_ptes

Unused.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: VMX: remove setting of shadow_base_ptes for EPT
Marcelo Tosatti [Fri, 22 Oct 2010 16:18:15 +0000 (14:18 -0200)]
KVM: VMX: remove setting of shadow_base_ptes for EPT

The EPT present/writable bits use the same position as normal
pagetable bits.

Since direct_map passes ACC_ALL to mmu_set_spte, thus always setting
the writable bit on sptes, use the generic PT_PRESENT shadow_base_pte.

Also pass present/writable error code information from EPT violation
to generic pagefault handler.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: Avoid double interrupt injection with vapic
Avi Kivity [Mon, 25 Oct 2010 13:23:55 +0000 (15:23 +0200)]
KVM: Avoid double interrupt injection with vapic

After an interrupt injection, the PPR changes, and we have to reflect that
into the vapic.  This causes a KVM_REQ_EVENT to be set, which causes the
whole interrupt injection routine to be run again (harmlessly).

Optimize by only setting KVM_REQ_EVENT if the ppr was lowered; otherwise
there is no chance that a new injection is needed.

Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: SVM: Fold save_host_msrs() and load_host_msrs() into their callers
Avi Kivity [Thu, 21 Oct 2010 10:20:34 +0000 (12:20 +0200)]
KVM: SVM: Fold save_host_msrs() and load_host_msrs() into their callers

This abstraction only serves to obfuscate.  Remove.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: SVM: Move fs/gs/ldt save/restore to heavyweight exit path
Avi Kivity [Thu, 21 Oct 2010 10:20:33 +0000 (12:20 +0200)]
KVM: SVM: Move fs/gs/ldt save/restore to heavyweight exit path

ldt is never used in the kernel context; same goes for fs (x86_64) and gs
(i386).  So save/restore them in the heavyweight exit path instead
of the lightweight path.

By itself, this doesn't buy us much, but it paves the way for moving vmload
and vmsave to the heavyweight exit path, since they modify the same registers.

[jan: fix copy/pase mistake on i386]

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: SVM: Move svm->host_gs_base into a separate structure
Avi Kivity [Thu, 21 Oct 2010 10:20:32 +0000 (12:20 +0200)]
KVM: SVM: Move svm->host_gs_base into a separate structure

More members will join it soon.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: SVM: Move guest register save out of interrupts disabled section
Avi Kivity [Thu, 21 Oct 2010 10:20:31 +0000 (12:20 +0200)]
KVM: SVM: Move guest register save out of interrupts disabled section

Saving guest registers is just a memory copy, and does not need to be in the
critical section.  Move outside the critical section to improve latency a
bit.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: x86: Add missing inline tag to kvm_read_and_reset_pf_reason
Jan Kiszka [Wed, 20 Oct 2010 16:34:54 +0000 (18:34 +0200)]
KVM: x86: Add missing inline tag to kvm_read_and_reset_pf_reason

May otherwise generates build warnings about unused
kvm_read_and_reset_pf_reason if included without CONFIG_KVM_GUEST
enabled.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Move KVM context switch into own function
Andi Kleen [Wed, 20 Oct 2010 15:56:17 +0000 (17:56 +0200)]
KVM: Move KVM context switch into own function

gcc 4.5 with some special options is able to duplicate the VMX
context switch asm in vmx_vcpu_run(). This results in a compile error
because the inline asm sequence uses an on local label. The non local
label is needed because other code wants to set up the return address.

This patch moves the asm code into an own function and marks
that explicitely noinline to avoid this problem.

Better would be probably to just move it into an .S file.

The diff looks worse than the change really is, it's all just
code movement and no logic change.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: x86: Mark kvm_arch_setup_async_pf static
Jan Kiszka [Wed, 20 Oct 2010 13:18:02 +0000 (15:18 +0200)]
KVM: x86: Mark kvm_arch_setup_async_pf static

It has no user outside mmu.c and also no prototype.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: improve hva_to_pfn() readability
Gleb Natapov [Tue, 19 Oct 2010 16:13:41 +0000 (18:13 +0200)]
KVM: improve hva_to_pfn() readability

Improve vma handling code readability in hva_to_pfn() and fix
async pf handling code to properly check vma returned by find_vma().

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Send async PF when guest is not in userspace too.
Gleb Natapov [Thu, 14 Oct 2010 09:22:56 +0000 (11:22 +0200)]
KVM: Send async PF when guest is not in userspace too.

If guest indicates that it can handle async pf in kernel mode too send
it, but only if interrupts are enabled.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Let host know whether the guest can handle async PF in non-userspace context.
Gleb Natapov [Thu, 14 Oct 2010 09:22:55 +0000 (11:22 +0200)]
KVM: Let host know whether the guest can handle async PF in non-userspace context.

If guest can detect that it runs in non-preemptable context it can
handle async PFs at any time, so let host know that it can send async
PF even if guest cpu is not in userspace.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM paravirt: Handle async PF in non preemptable context
Gleb Natapov [Thu, 14 Oct 2010 09:22:54 +0000 (11:22 +0200)]
KVM paravirt: Handle async PF in non preemptable context

If async page fault is received by idle task or when preemp_count is
not zero guest cannot reschedule, so do sti; hlt and wait for page to be
ready. vcpu can still process interrupts while it waits for the page to
be ready.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Inject asynchronous page fault into a PV guest if page is swapped out.
Gleb Natapov [Thu, 14 Oct 2010 09:22:53 +0000 (11:22 +0200)]
KVM: Inject asynchronous page fault into a PV guest if page is swapped out.

Send async page fault to a PV guest if it accesses swapped out memory.
Guest will choose another task to run upon receiving the fault.

Allow async page fault injection only when guest is in user mode since
otherwise guest may be in non-sleepable context and will not be able
to reschedule.

Vcpu will be halted if guest will fault on the same page again or if
vcpu executes kernel code.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Handle async PF in a guest.
Gleb Natapov [Thu, 14 Oct 2010 09:22:52 +0000 (11:22 +0200)]
KVM: Handle async PF in a guest.

When async PF capability is detected hook up special page fault handler
that will handle async page fault events and bypass other page faults to
regular page fault handler. Also add async PF handling to nested SVM
emulation. Async PF always generates exit to L1 where vcpu thread will
be scheduled out until page is available.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM paravirt: Add async PF initialization to PV guest.
Gleb Natapov [Thu, 14 Oct 2010 09:22:51 +0000 (11:22 +0200)]
KVM paravirt: Add async PF initialization to PV guest.

Enable async PF in a guest if async PF capability is discovered.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Add PV MSR to enable asynchronous page faults delivery.
Gleb Natapov [Thu, 14 Oct 2010 09:22:50 +0000 (11:22 +0200)]
KVM: Add PV MSR to enable asynchronous page faults delivery.

Guest enables async PF vcpu functionality using this MSR.

Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM paravirt: Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c.
Gleb Natapov [Thu, 14 Oct 2010 09:22:49 +0000 (11:22 +0200)]
KVM paravirt: Move kvm_smp_prepare_boot_cpu() from kvmclock.c to kvm.c.

Async PF also needs to hook into smp_prepare_boot_cpu so move the hook
into generic code.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Add memory slot versioning and use it to provide fast guest write interface
Gleb Natapov [Mon, 18 Oct 2010 13:22:23 +0000 (15:22 +0200)]
KVM: Add memory slot versioning and use it to provide fast guest write interface

Keep track of memslots changes by keeping generation number in memslots
structure. Provide kvm_write_guest_cached() function that skips
gfn_to_hva() translation if memslots was not changed since previous
invocation.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Retry fault before vmentry
Gleb Natapov [Sun, 17 Oct 2010 16:13:42 +0000 (18:13 +0200)]
KVM: Retry fault before vmentry

When page is swapped in it is mapped into guest memory only after guest
tries to access it again and generate another fault. To save this fault
we can map it immediately since we know that guest is going to access
the page. Do it only when tdp is enabled for now. Shadow paging case is
more complicated. CR[034] and EFER registers should be switched before
doing mapping and then switched back.

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Halt vcpu if page it tries to access is swapped out
Gleb Natapov [Thu, 14 Oct 2010 09:22:46 +0000 (11:22 +0200)]
KVM: Halt vcpu if page it tries to access is swapped out

If a guest accesses swapped out memory do not swap it in from vcpu thread
context. Schedule work to do swapping and put vcpu into halted state
instead.

Interrupts will still be delivered to the guest and if interrupt will
cause reschedule guest will continue to run another task.

[avi: remove call to get_user_pages_noio(), nacked by Linus; this
      makes everything synchrnous again]

Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
14 years agoKVM: Don't reset mmu context unnecessarily when updating EFER
Avi Kivity [Mon, 11 Oct 2010 12:23:39 +0000 (14:23 +0200)]
KVM: Don't reset mmu context unnecessarily when updating EFER

The only bit of EFER that affects the mmu is NX, and this is already
accounted for (LME only takes effect when changing cr0).

Based on a patch by Hillf Danton.

Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: i8259: initialize isr_ack
Avi Kivity [Fri, 31 Dec 2010 08:52:15 +0000 (10:52 +0200)]
KVM: i8259: initialize isr_ack

isr_ack is never initialized.  So, until the first PIC reset, interrupts
may fail to be injected.  This can cause Windows XP to fail to boot, as
reported in the fallout from the fix to
https://bugzilla.kernel.org/show_bug.cgi?id=21962.

Reported-and-tested-by: Nicolas Prochazka <prochazka.nicolas@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoKVM: MMU: Fix incorrect direct gfn for unpaged mode shadow
Avi Kivity [Tue, 28 Dec 2010 10:09:07 +0000 (12:09 +0200)]
KVM: MMU: Fix incorrect direct gfn for unpaged mode shadow

We use the physical address instead of the base gfn for the four
PAE page directories we use in unpaged mode.  When the guest accesses
an address above 1GB that is backed by a large host page, a BUG_ON()
in kvm_mmu_set_gfn() triggers.

Resolves: https://bugzilla.kernel.org/show_bug.cgi?id=21962
Reported-and-tested-by: Nicolas Prochazka <prochazka.nicolas@gmail.com>
KVM-Stable-Tag.
Signed-off-by: Avi Kivity <avi@redhat.com>
14 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile
Linus Torvalds [Sat, 18 Dec 2010 18:28:54 +0000 (10:28 -0800)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile

* git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile:
  arch/tile: handle rt_sigreturn() more cleanly
  arch/tile: handle CLONE_SETTLS in copy_thread(), not user space

14 years agoMerge branch 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus
Linus Torvalds [Sat, 18 Dec 2010 18:23:29 +0000 (10:23 -0800)]
Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus

* 'upstream' of git://git.linux-mips.org/pub/scm/upstream-linus:
  MIPS: Fix build errors in sc-mips.c

14 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes...
Linus Torvalds [Sat, 18 Dec 2010 18:13:24 +0000 (10:13 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
  x86: avoid high BIOS area when allocating address space
  x86: avoid E820 regions when allocating address space
  x86: avoid low BIOS area when allocating address space
  resources: add arch hook for preventing allocation in reserved areas
  Revert "resources: support allocating space within a region from the top down"
  Revert "PCI: allocate bus resources from the top down"
  Revert "x86/PCI: allocate space from the end of a region, not the beginning"
  Revert "x86: allocate space within a region top-down"
  Revert "PCI: fix pci_bus_alloc_resource() hang, prefer positive decode"
  PCI: Update MCP55 quirk to not affect non HyperTransport variants

14 years agoarch/tile: handle rt_sigreturn() more cleanly
Chris Metcalf [Tue, 14 Dec 2010 21:07:25 +0000 (16:07 -0500)]
arch/tile: handle rt_sigreturn() more cleanly

The current tile rt_sigreturn() syscall pattern uses the common idiom
of loading up pt_regs with all the saved registers from the time of
the signal, then anticipating the fact that we will clobber the ABI
"return value" register (r0) as we return from the syscall by setting
the rt_sigreturn return value to whatever random value was in the pt_regs
for r0.

However, this breaks in our 64-bit kernel when running "compat" tasks,
since we always sign-extend the "return value" register to properly
handle returned pointers that are in the upper 2GB of the 32-bit compat
address space.  Doing this to the sigreturn path then causes occasional
random corruption of the 64-bit r0 register.

Instead, we stop doing the crazy "load the return-value register"
hack in sigreturn.  We already have some sigreturn-specific assembly
code that we use to pass the pt_regs pointer to C code.  We extend that
code to also set the link register to point to a spot a few instructions
after the usual syscall return address so we don't clobber the saved r0.
Now it no longer matters what the rt_sigreturn syscall returns, and the
pt_regs structure can be cleanly and completely reloaded.

Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
14 years agoarch/tile: handle CLONE_SETTLS in copy_thread(), not user space
Chris Metcalf [Tue, 14 Dec 2010 20:57:49 +0000 (15:57 -0500)]
arch/tile: handle CLONE_SETTLS in copy_thread(), not user space

Previously we were just setting up the "tp" register in the
new task as started by clone() in libc.  However, this is not
quite right, since in principle a signal might be delivered to
the new task before it had its TLS set up.  (Of course, this race
window still exists for resetting the libc getpid() cached value
in the new task, in principle.  But in any case, we are now doing
this exactly the way all other architectures do it.)

This change is important for 2.6.37 since the tile glibc we will
be submitting upstream will not set TLS in user space any more,
so it will only work on a kernel that has this fix.  It should
also be taken for 2.6.36.x in the stable tree if possible.

Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
Cc: stable <stable@kernel.org>
14 years agoMIPS: Fix build errors in sc-mips.c
Kevin Cernekee [Wed, 3 Nov 2010 05:28:01 +0000 (22:28 -0700)]
MIPS: Fix build errors in sc-mips.c

Seen with malta_defconfig on Linus' tree:

  CC      arch/mips/mm/sc-mips.o
arch/mips/mm/sc-mips.c: In function 'mips_sc_is_activated':
arch/mips/mm/sc-mips.c:77: error: 'config2' undeclared (first use in this function)
arch/mips/mm/sc-mips.c:77: error: (Each undeclared identifier is reported only once
arch/mips/mm/sc-mips.c:77: error: for each function it appears in.)
arch/mips/mm/sc-mips.c:81: error: 'tmp' undeclared (first use in this function)
make[2]: *** [arch/mips/mm/sc-mips.o] Error 1
make[1]: *** [arch/mips/mm] Error 2
make: *** [arch/mips] Error 2

[Ralf: Cosmetic changes to minimize the number of arguments passed to
mips_sc_is_activated]

Signed-off-by: Kevin Cernekee <cernekee@gmail.com>
Patchwork: https://patchwork.linux-mips.org/patch/1752/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
14 years agox86: avoid high BIOS area when allocating address space
Bjorn Helgaas [Thu, 16 Dec 2010 17:39:02 +0000 (10:39 -0700)]
x86: avoid high BIOS area when allocating address space

This prevents allocation of the last 2MB before 4GB.

The experiment described here shows Windows 7 ignoring the last 1MB:
https://bugzilla.kernel.org/show_bug.cgi?id=23542#c27

This patch ignores the top 2MB instead of just 1MB because H. Peter Anvin
says "There will be ROM at the top of the 32-bit address space; it's a fact
of the architecture, and on at least older systems it was common to have a
shadow 1 MiB below."

Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
14 years agox86: avoid E820 regions when allocating address space
Bjorn Helgaas [Thu, 16 Dec 2010 17:38:56 +0000 (10:38 -0700)]
x86: avoid E820 regions when allocating address space

When we allocate address space, e.g., to assign it to a PCI device, don't
allocate anything mentioned in the BIOS E820 memory map.

On recent machines (2008 and newer), we assign PCI resources from the
windows described by the ACPI PCI host bridge _CRS.  On many Dell
machines, these windows overlap some E820 reserved areas, e.g.,

    BIOS-e820: 00000000bfe4dc00 - 00000000c0000000 (reserved)
    pci_root PNP0A03:00: host bridge window [mem 0xbff00000-0xdfffffff]

If we put devices at 0xbff00000, they don't work, probably because
that's really RAM, not I/O memory.  This patch prevents that by removing
the 0xbfe4dc00-0xbfffffff area from the "available" resource.

I'm not very happy with this solution because Windows solves the problem
differently (it seems to ignore E820 reserved areas and it allocates
top-down instead of bottom-up; details at comment 45 of the bugzilla
below).  That means we're vulnerable to BIOS defects that Windows would not
trip over.  For example, if BIOS described a device in ACPI but didn't
mention it in E820, Windows would work fine but Linux would fail.

Reference: https://bugzilla.kernel.org/show_bug.cgi?id=16228
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
14 years agox86: avoid low BIOS area when allocating address space
Bjorn Helgaas [Thu, 16 Dec 2010 17:38:51 +0000 (10:38 -0700)]
x86: avoid low BIOS area when allocating address space

This implements arch_remove_reservations() so allocate_resource() can
avoid any arch-specific reserved areas.  This currently just avoids the
BIOS area (the first 1MB), but could be used for E820 reserved areas if
that turns out to be necessary.

We previously avoided this area in pcibios_align_resource().  This patch
moves the test from that PCI-specific path to a generic path, so *all*
resource allocations will avoid this area.

Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
14 years agoresources: add arch hook for preventing allocation in reserved areas
Bjorn Helgaas [Thu, 16 Dec 2010 17:38:46 +0000 (10:38 -0700)]
resources: add arch hook for preventing allocation in reserved areas

This adds arch_remove_reservations(), which an arch can implement if it
needs to protect part of the address space from allocation.

Sometimes that can be done by just putting a region in the resource tree,
but there are cases where that doesn't work well.  For example, x86 BIOS
E820 reservations are not related to devices, so they may overlap part of,
all of, or more than a device resource, so they may not end up at the
correct spot in the resource tree.

Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>