KVM: X86: Implement ring-based dirty memory tracking

author Peter Xu <peterx@redhat.com>

Thu, 1 Oct 2020 01:22:22 +0000 (21:22 -0400)

committer Paolo Bonzini <pbonzini@redhat.com>

Sun, 15 Nov 2020 14:49:15 +0000 (09:49 -0500)
author Peter Xu <peterx@redhat.com>
Thu, 1 Oct 2020 01:22:22 +0000 (21:22 -0400)
committer Paolo Bonzini <pbonzini@redhat.com>
Sun, 15 Nov 2020 14:49:15 +0000 (09:49 -0500)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst

index 81d54fe76a2d50986ea7d5930f94fe9f29f063d2..e264ebc35e27a18474d7ab43fcfbb38b570ad322 100644 (file)
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -262,6 +262,18 @@ The KVM_RUN ioctl (cf.) communicates with userspace via a shared
  memory region.  This ioctl returns the size of that region.  See the
  KVM_RUN documentation for details.
  
+Besides the size of the KVM_RUN communication region, other areas of
+the VCPU file descriptor can be mmap-ed, including:
+
+- if KVM_CAP_COALESCED_MMIO is available, a page at
+  KVM_COALESCED_MMIO_PAGE_OFFSET * PAGE_SIZE; for historical reasons,
+  this page is included in the result of KVM_GET_VCPU_MMAP_SIZE.
+  KVM_CAP_COALESCED_MMIO is not documented yet.
+
+- if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
+  KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE.  For more information on
+  KVM_CAP_DIRTY_LOG_RING, see section 8.3.
+
  
  4.6 KVM_SET_MEMORY_REGION
  -------------------------
@@ -6396,3 +6408,84 @@ When enabled, KVM will disable paravirtual features provided to the
  guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf
  (0x40000001). Otherwise, a guest may use the paravirtual features
  regardless of what has actually been exposed through the CPUID leaf.
+
+
+8.29 KVM_CAP_DIRTY_LOG_RING
+---------------------------
+
+:Architectures: x86
+:Parameters: args[0] - size of the dirty log ring
+
+KVM is capable of tracking dirty memory using ring buffers that are
+mmaped into userspace; there is one dirty ring per vcpu.
+
+The dirty ring is available to userspace as an array of
+``struct kvm_dirty_gfn``.  Each dirty entry it's defined as::
+
+  struct kvm_dirty_gfn {
+          __u32 flags;
+          __u32 slot; /* as_id | slot_id */
+          __u64 offset;
+  };
+
+The following values are defined for the flags field to define the
+current state of the entry::
+
+  #define KVM_DIRTY_GFN_F_DIRTY           BIT(0)
+  #define KVM_DIRTY_GFN_F_RESET           BIT(1)
+  #define KVM_DIRTY_GFN_F_MASK            0x3
+
+Userspace should call KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM
+ioctl to enable this capability for the new guest and set the size of
+the rings.  Enabling the capability is only allowed before creating any
+vCPU, and the size of the ring must be a power of two.  The larger the
+ring buffer, the less likely the ring is full and the VM is forced to
+exit to userspace. The optimal size depends on the workload, but it is
+recommended that it be at least 64 KiB (4096 entries).
+
+Just like for dirty page bitmaps, the buffer tracks writes to
+all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
+set in KVM_SET_USER_MEMORY_REGION.  Once a memory region is registered
+with the flag set, userspace can start harvesting dirty pages from the
+ring buffer.
+
+An entry in the ring buffer can be unused (flag bits ``00``),
+dirty (flag bits ``01``) or harvested (flag bits ``1X``).  The
+state machine for the entry is as follows::
+
+          dirtied         harvested        reset
+     00 -----------> 01 -------------> 1X -------+
+      ^                                          |
+      |                                          |
+      +------------------------------------------+
+
+To harvest the dirty pages, userspace accesses the mmaped ring buffer
+to read the dirty GFNs.  If the flags has the DIRTY bit set (at this stage
+the RESET bit must be cleared), then it means this GFN is a dirty GFN.
+The userspace should harvest this GFN and mark the flags from state
+``01b`` to ``1Xb`` (bit 0 will be ignored by KVM, but bit 1 must be set
+to show that this GFN is harvested and waiting for a reset), and move
+on to the next GFN.  The userspace should continue to do this until the
+flags of a GFN have the DIRTY bit cleared, meaning that it has harvested
+all the dirty GFNs that were available.
+
+It's not necessary for userspace to harvest the all dirty GFNs at once.
+However it must collect the dirty GFNs in sequence, i.e., the userspace
+program cannot skip one dirty GFN to collect the one next to it.
+
+After processing one or more entries in the ring buffer, userspace
+calls the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about
+it, so that the kernel will reprotect those collected GFNs.
+Therefore, the ioctl must be called *before* reading the content of
+the dirty pages.
+
+The dirty ring can get full.  When it happens, the KVM_RUN of the
+vcpu will return with exit reason KVM_EXIT_DIRTY_LOG_FULL.
+
+The dirty ring interface has a major difference comparing to the
+KVM_GET_DIRTY_LOG interface in that, when reading the dirty ring from
+userspace, it's still possible that the kernel has not yet flushed the
+processor's dirty page buffers into the kernel buffer (with dirty bitmaps, the
+flushing is done by the KVM_GET_DIRTY_LOG ioctl).  To achieve that, one
+needs to kick the vcpu out of KVM_RUN using a signal.  The resulting
+vmexit ensures that all dirty GFNs are flushed to the dirty rings.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h

index 69e94aa716e94dd5b03b7898b18dacc1eaad47d7..f002cdb13a0b15dd62370e88665dce22dfe8dc8f 100644 (file)
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1232,6 +1232,7 @@ struct kvm_x86_ops {
         void (*enable_log_dirty_pt_masked)(struct kvm *kvm,
                                            struct kvm_memory_slot *slot,
                                            gfn_t offset, unsigned long mask);
+       int (*cpu_dirty_log_size)(void);
  
         /* pmu operations of sub-arch */
         const struct kvm_pmu_ops *pmu_ops;
@@ -1744,4 +1745,6 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
  #define GET_SMSTATE(type, buf, offset)         \
         (*(type *)((buf) + (offset) - 0x7e00))
  
+int kvm_cpu_dirty_log_size(void);
+
  #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h

index 89e5f3d1bba86b8bbb0e7a80cbed2d1373d093af..8e76d3701db3f2432c91c9bd32623202547593b1 100644 (file)
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -12,6 +12,7 @@
  
  #define KVM_PIO_PAGE_OFFSET 1
  #define KVM_COALESCED_MMIO_PAGE_OFFSET 2
+#define KVM_DIRTY_LOG_PAGE_OFFSET 64
  
  #define DE_VECTOR 0
  #define DB_VECTOR 1
diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile

index b804444e16d47d017db4d24c9a52872701199128..4bd14ab0132324b4fdc9a5ace9c438663a0d7a56 100644 (file)
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -10,7 +10,8 @@ endif
  KVM := ../../../virt/kvm
  
  kvm-y                  += $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o \
-                               $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o
+                               $(KVM)/eventfd.o $(KVM)/irqchip.o $(KVM)/vfio.o \
+                               $(KVM)/dirty_ring.o
  kvm-$(CONFIG_KVM_ASYNC_PF)     += $(KVM)/async_pf.o
  
  kvm-y                  += x86.o emulate.o i8259.o irq.o lapic.o \
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c

index 5bb1939b65d8d910717218178a151e9af316bf3a..12e5cfe0995e092aadcb78df84446d63d1be7c5b 100644 (file)
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1289,6 +1289,14 @@ void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
                 kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
  }
  
+int kvm_cpu_dirty_log_size(void)
+{
+       if (kvm_x86_ops.cpu_dirty_log_size)
+               return kvm_x86_ops.cpu_dirty_log_size();
+
+       return 0;
+}
+
  bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
                                     struct kvm_memory_slot *slot, u64 gfn)
  {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c

index ff28a5c6abd6340085f2a765a482fdac7a7d2ea1..cffa51c6049ee973ef2e9e180781e77e4be93025 100644 (file)
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -185,7 +185,7 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
         if ((!is_writable_pte(old_spte) || pfn_changed) &&
             is_writable_pte(new_spte)) {
                 slot = __gfn_to_memslot(__kvm_memslots(kvm, as_id), gfn);
-               mark_page_dirty_in_slot(slot, gfn);
+               mark_page_dirty_in_slot(kvm, slot, gfn);
         }
  }
  
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c

index 46b32aa43811df5f19eb710c66189b680743b3ba..2b6d538454a61e6aa8fc8772cf733df35c3f779b 100644 (file)
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7583,6 +7583,11 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
         return supported & BIT(bit);
  }
  
+static int vmx_cpu_dirty_log_size(void)
+{
+       return enable_pml ? PML_ENTITY_NUM : 0;
+}
+
  static struct kvm_x86_ops vmx_x86_ops __initdata = {
         .hardware_unsetup = hardware_unsetup,
  
@@ -7712,6 +7717,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
         .migrate_timers = vmx_migrate_timers,
  
         .msr_filter_changed = vmx_msr_filter_changed,
+       .cpu_dirty_log_size = vmx_cpu_dirty_log_size,
  };
  
  static __init int hardware_setup(void)
@@ -7829,6 +7835,7 @@ static __init int hardware_setup(void)
                 vmx_x86_ops.slot_disable_log_dirty = NULL;
                 vmx_x86_ops.flush_log_dirty = NULL;
                 vmx_x86_ops.enable_log_dirty_pt_masked = NULL;
+               vmx_x86_ops.cpu_dirty_log_size = NULL;
         }
  
         if (!cpu_has_vmx_preemption_timer())
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c

index b4ac726526f8a30e6033f83530d0d430ad71e71b..6c704a597b7caf63eeb7eb61435fe742781cf09d 100644 (file)
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8754,6 +8754,15 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
  
         bool req_immediate_exit = false;
  
+       /* Forbid vmenter if vcpu dirty ring is soft-full */
+       if (unlikely(vcpu->kvm->dirty_ring_size &&
+                    kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
+               vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
+               trace_kvm_dirty_ring_exit(vcpu);
+               r = 0;
+               goto out;
+       }
+
         if (kvm_request_pending(vcpu)) {
                 if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
                         if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
diff --git a/include/linux/kvm_dirty_ring.h b/include/linux/kvm_dirty_ring.h

new file mode 100644 (file)

index 0000000..120e5e9
--- /dev/null
+++ b/include/linux/kvm_dirty_ring.h
@@ -0,0 +1,103 @@
+#ifndef KVM_DIRTY_RING_H
+#define KVM_DIRTY_RING_H
+
+#include <linux/kvm.h>
+
+/**
+ * kvm_dirty_ring: KVM internal dirty ring structure
+ *
+ * @dirty_index: free running counter that points to the next slot in
+ *               dirty_ring->dirty_gfns, where a new dirty page should go
+ * @reset_index: free running counter that points to the next dirty page
+ *               in dirty_ring->dirty_gfns for which dirty trap needs to
+ *               be reenabled
+ * @size:        size of the compact list, dirty_ring->dirty_gfns
+ * @soft_limit:  when the number of dirty pages in the list reaches this
+ *               limit, vcpu that owns this ring should exit to userspace
+ *               to allow userspace to harvest all the dirty pages
+ * @dirty_gfns:  the array to keep the dirty gfns
+ * @index:       index of this dirty ring
+ */
+struct kvm_dirty_ring {
+       u32 dirty_index;
+       u32 reset_index;
+       u32 size;
+       u32 soft_limit;
+       struct kvm_dirty_gfn *dirty_gfns;
+       int index;
+};
+
+#if (KVM_DIRTY_LOG_PAGE_OFFSET == 0)
+/*
+ * If KVM_DIRTY_LOG_PAGE_OFFSET not defined, kvm_dirty_ring.o should
+ * not be included as well, so define these nop functions for the arch.
+ */
+static inline u32 kvm_dirty_ring_get_rsvd_entries(void)
+{
+       return 0;
+}
+
+static inline int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring,
+                                      int index, u32 size)
+{
+       return 0;
+}
+
+static inline struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm)
+{
+       return NULL;
+}
+
+static inline int kvm_dirty_ring_reset(struct kvm *kvm,
+                                      struct kvm_dirty_ring *ring)
+{
+       return 0;
+}
+
+static inline void kvm_dirty_ring_push(struct kvm_dirty_ring *ring,
+                                      u32 slot, u64 offset)
+{
+}
+
+static inline struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring,
+                                                  u32 offset)
+{
+       return NULL;
+}
+
+static inline void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
+{
+}
+
+static inline bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring)
+{
+       return true;
+}
+
+#else /* KVM_DIRTY_LOG_PAGE_OFFSET == 0 */
+
+u32 kvm_dirty_ring_get_rsvd_entries(void);
+int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size);
+struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm);
+
+/*
+ * called with kvm->slots_lock held, returns the number of
+ * processed pages.
+ */
+int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring);
+
+/*
+ * returns =0: successfully pushed
+ *         <0: unable to push, need to wait
+ */
+void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset);
+
+/* for use in vm_operations_struct */
+struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset);
+
+void kvm_dirty_ring_free(struct kvm_dirty_ring *ring);
+bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring);
+
+#endif /* KVM_DIRTY_LOG_PAGE_OFFSET == 0 */
+
+#endif /* KVM_DIRTY_RING_H */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h

index ca7c1459a8e3083f5d82a8f63bd1a9048d601a5c..864b156391c887a6644c0b451e067638d4239388 100644 (file)
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -34,6 +34,7 @@
  #include <linux/kvm_types.h>
  
  #include <asm/kvm_host.h>
+#include <linux/kvm_dirty_ring.h>
  
  #ifndef KVM_MAX_VCPU_ID
  #define KVM_MAX_VCPU_ID KVM_MAX_VCPUS
@@ -319,6 +320,7 @@ struct kvm_vcpu {
         bool preempted;
         bool ready;
         struct kvm_vcpu_arch arch;
+       struct kvm_dirty_ring dirty_ring;
  };
  
  static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
@@ -505,6 +507,7 @@ struct kvm {
         struct srcu_struct irq_srcu;
         pid_t userspace_pid;
         unsigned int max_halt_poll_ns;
+       u32 dirty_ring_size;
  };
  
  #define kvm_err(fmt, ...) \
@@ -1477,4 +1480,14 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
  }
  #endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */
  
+/*
+ * This defines how many reserved entries we want to keep before we
+ * kick the vcpu to the userspace to avoid dirty ring full.  This
+ * value can be tuned to higher if e.g. PML is enabled on the host.
+ */
+#define  KVM_DIRTY_RING_RSVD_ENTRIES  64
+
+/* Max number of entries allowed for each kvm dirty ring */
+#define  KVM_DIRTY_RING_MAX_ENTRIES  65536
+
  #endif
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h

index 26cfb0fa8e7ee780658d0f9a03c0deb7f22358a7..49d7d0fe29f6fb875d701215047686e5df82fbd7 100644 (file)
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -399,6 +399,69 @@ TRACE_EVENT(kvm_halt_poll_ns,
  #define trace_kvm_halt_poll_ns_shrink(vcpu_id, new, old) \
         trace_kvm_halt_poll_ns(false, vcpu_id, new, old)
  
+TRACE_EVENT(kvm_dirty_ring_push,
+       TP_PROTO(struct kvm_dirty_ring *ring, u32 slot, u64 offset),
+       TP_ARGS(ring, slot, offset),
+
+       TP_STRUCT__entry(
+               __field(int, index)
+               __field(u32, dirty_index)
+               __field(u32, reset_index)
+               __field(u32, slot)
+               __field(u64, offset)
+       ),
+
+       TP_fast_assign(
+               __entry->index          = ring->index;
+               __entry->dirty_index    = ring->dirty_index;
+               __entry->reset_index    = ring->reset_index;
+               __entry->slot           = slot;
+               __entry->offset         = offset;
+       ),
+
+       TP_printk("ring %d: dirty 0x%x reset 0x%x "
+                 "slot %u offset 0x%llx (used %u)",
+                 __entry->index, __entry->dirty_index,
+                 __entry->reset_index,  __entry->slot, __entry->offset,
+                 __entry->dirty_index - __entry->reset_index)
+);
+
+TRACE_EVENT(kvm_dirty_ring_reset,
+       TP_PROTO(struct kvm_dirty_ring *ring),
+       TP_ARGS(ring),
+
+       TP_STRUCT__entry(
+               __field(int, index)
+               __field(u32, dirty_index)
+               __field(u32, reset_index)
+       ),
+
+       TP_fast_assign(
+               __entry->index          = ring->index;
+               __entry->dirty_index    = ring->dirty_index;
+               __entry->reset_index    = ring->reset_index;
+       ),
+
+       TP_printk("ring %d: dirty 0x%x reset 0x%x (used %u)",
+                 __entry->index, __entry->dirty_index, __entry->reset_index,
+                 __entry->dirty_index - __entry->reset_index)
+);
+
+TRACE_EVENT(kvm_dirty_ring_exit,
+       TP_PROTO(struct kvm_vcpu *vcpu),
+       TP_ARGS(vcpu),
+
+       TP_STRUCT__entry(
+           __field(int, vcpu_id)
+       ),
+
+       TP_fast_assign(
+           __entry->vcpu_id = vcpu->vcpu_id;
+       ),
+
+       TP_printk("vcpu %d", __entry->vcpu_id)
+);
+
  #endif /* _TRACE_KVM_MAIN_H */
  
  /* This part must be outside protection */
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h

index 204afbe1240e39a644374db5e14e29ec47b54e2d..886802b8ffba3617f64b47f7f6f452335fdfa23e 100644 (file)
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -250,6 +250,7 @@ struct kvm_hyperv_exit {
  #define KVM_EXIT_ARM_NISV         28
  #define KVM_EXIT_X86_RDMSR        29
  #define KVM_EXIT_X86_WRMSR        30
+#define KVM_EXIT_DIRTY_RING_FULL  31
  
  /* For KVM_EXIT_INTERNAL_ERROR */
  /* Emulate instruction failed. */
@@ -1054,6 +1055,7 @@ struct kvm_ppc_resize_hpt {
  #define KVM_CAP_X86_MSR_FILTER 189
  #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
  #define KVM_CAP_SYS_HYPERV_CPUID 191
+#define KVM_CAP_DIRTY_LOG_RING 192
  
  #ifdef KVM_CAP_IRQ_ROUTING
  
@@ -1558,6 +1560,9 @@ struct kvm_pv_cmd {
  /* Available with KVM_CAP_X86_MSR_FILTER */
  #define KVM_X86_SET_MSR_FILTER _IOW(KVMIO,  0xc6, struct kvm_msr_filter)
  
+/* Available with KVM_CAP_DIRTY_LOG_RING */
+#define KVM_RESET_DIRTY_RINGS          _IO(KVMIO, 0xc7)
+
  /* Secure Encrypted Virtualization command */
  enum sev_cmd_id {
         /* Guest initialization commands */
@@ -1711,4 +1716,52 @@ struct kvm_hyperv_eventfd {
  #define KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE    (1 << 0)
  #define KVM_DIRTY_LOG_INITIALLY_SET            (1 << 1)
  
+/*
+ * Arch needs to define the macro after implementing the dirty ring
+ * feature.  KVM_DIRTY_LOG_PAGE_OFFSET should be defined as the
+ * starting page offset of the dirty ring structures.
+ */
+#ifndef KVM_DIRTY_LOG_PAGE_OFFSET
+#define KVM_DIRTY_LOG_PAGE_OFFSET 0
+#endif
+
+/*
+ * KVM dirty GFN flags, defined as:
+ *
+ * |---------------+---------------+--------------|
+ * | bit 1 (reset) | bit 0 (dirty) | Status       |
+ * |---------------+---------------+--------------|
+ * |             0 |             0 | Invalid GFN  |
+ * |             0 |             1 | Dirty GFN    |
+ * |             1 |             X | GFN to reset |
+ * |---------------+---------------+--------------|
+ *
+ * Lifecycle of a dirty GFN goes like:
+ *
+ *      dirtied         harvested        reset
+ * 00 -----------> 01 -------------> 1X -------+
+ *  ^                                          |
+ *  |                                          |
+ *  +------------------------------------------+
+ *
+ * The userspace program is only responsible for the 01->1X state
+ * conversion after harvesting an entry.  Also, it must not skip any
+ * dirty bits, so that dirty bits are always harvested in sequence.
+ */
+#define KVM_DIRTY_GFN_F_DIRTY           BIT(0)
+#define KVM_DIRTY_GFN_F_RESET           BIT(1)
+#define KVM_DIRTY_GFN_F_MASK            0x3
+
+/*
+ * KVM dirty rings should be mapped at KVM_DIRTY_LOG_PAGE_OFFSET of
+ * per-vcpu mmaped regions as an array of struct kvm_dirty_gfn.  The
+ * size of the gfn buffer is decided by the first argument when
+ * enabling KVM_CAP_DIRTY_LOG_RING.
+ */
+struct kvm_dirty_gfn {
+       __u32 flags;
+       __u32 slot;
+       __u64 offset;
+};
+
  #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c

new file mode 100644 (file)

index 0000000..9d01299
--- /dev/null
+++ b/virt/kvm/dirty_ring.c
@@ -0,0 +1,194 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * KVM dirty ring implementation
+ *
+ * Copyright 2019 Red Hat, Inc.
+ */
+#include <linux/kvm_host.h>
+#include <linux/kvm.h>
+#include <linux/vmalloc.h>
+#include <linux/kvm_dirty_ring.h>
+#include <trace/events/kvm.h>
+
+int __weak kvm_cpu_dirty_log_size(void)
+{
+       return 0;
+}
+
+u32 kvm_dirty_ring_get_rsvd_entries(void)
+{
+       return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
+}
+
+static u32 kvm_dirty_ring_used(struct kvm_dirty_ring *ring)
+{
+       return READ_ONCE(ring->dirty_index) - READ_ONCE(ring->reset_index);
+}
+
+bool kvm_dirty_ring_soft_full(struct kvm_dirty_ring *ring)
+{
+       return kvm_dirty_ring_used(ring) >= ring->soft_limit;
+}
+
+static bool kvm_dirty_ring_full(struct kvm_dirty_ring *ring)
+{
+       return kvm_dirty_ring_used(ring) >= ring->size;
+}
+
+struct kvm_dirty_ring *kvm_dirty_ring_get(struct kvm *kvm)
+{
+       struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+
+       WARN_ON_ONCE(vcpu->kvm != kvm);
+
+       return &vcpu->dirty_ring;
+}
+
+static void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask)
+{
+       struct kvm_memory_slot *memslot;
+       int as_id, id;
+
+       as_id = slot >> 16;
+       id = (u16)slot;
+
+       if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
+               return;
+
+       memslot = id_to_memslot(__kvm_memslots(kvm, as_id), id);
+
+       if (!memslot || (offset + __fls(mask)) >= memslot->npages)
+               return;
+
+       spin_lock(&kvm->mmu_lock);
+       kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask);
+       spin_unlock(&kvm->mmu_lock);
+}
+
+int kvm_dirty_ring_alloc(struct kvm_dirty_ring *ring, int index, u32 size)
+{
+       ring->dirty_gfns = vmalloc(size);
+       if (!ring->dirty_gfns)
+               return -ENOMEM;
+       memset(ring->dirty_gfns, 0, size);
+
+       ring->size = size / sizeof(struct kvm_dirty_gfn);
+       ring->soft_limit = ring->size - kvm_dirty_ring_get_rsvd_entries();
+       ring->dirty_index = 0;
+       ring->reset_index = 0;
+       ring->index = index;
+
+       return 0;
+}
+
+static inline void kvm_dirty_gfn_set_invalid(struct kvm_dirty_gfn *gfn)
+{
+       gfn->flags = 0;
+}
+
+static inline void kvm_dirty_gfn_set_dirtied(struct kvm_dirty_gfn *gfn)
+{
+       gfn->flags = KVM_DIRTY_GFN_F_DIRTY;
+}
+
+static inline bool kvm_dirty_gfn_invalid(struct kvm_dirty_gfn *gfn)
+{
+       return gfn->flags == 0;
+}
+
+static inline bool kvm_dirty_gfn_harvested(struct kvm_dirty_gfn *gfn)
+{
+       return gfn->flags & KVM_DIRTY_GFN_F_RESET;
+}
+
+int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
+{
+       u32 cur_slot, next_slot;
+       u64 cur_offset, next_offset;
+       unsigned long mask;
+       int count = 0;
+       struct kvm_dirty_gfn *entry;
+       bool first_round = true;
+
+       /* This is only needed to make compilers happy */
+       cur_slot = cur_offset = mask = 0;
+
+       while (true) {
+               entry = &ring->dirty_gfns[ring->reset_index & (ring->size - 1)];
+
+               if (!kvm_dirty_gfn_harvested(entry))
+                       break;
+
+               next_slot = READ_ONCE(entry->slot);
+               next_offset = READ_ONCE(entry->offset);
+
+               /* Update the flags to reflect that this GFN is reset */
+               kvm_dirty_gfn_set_invalid(entry);
+
+               ring->reset_index++;
+               count++;
+               /*
+                * Try to coalesce the reset operations when the guest is
+                * scanning pages in the same slot.
+                */
+               if (!first_round && next_slot == cur_slot) {
+                       s64 delta = next_offset - cur_offset;
+
+                       if (delta >= 0 && delta < BITS_PER_LONG) {
+                               mask |= 1ull << delta;
+                               continue;
+                       }
+
+                       /* Backwards visit, careful about overflows!  */
+                       if (delta > -BITS_PER_LONG && delta < 0 &&
+                           (mask << -delta >> -delta) == mask) {
+                               cur_offset = next_offset;
+                               mask = (mask << -delta) | 1;
+                               continue;
+                       }
+               }
+               kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
+               cur_slot = next_slot;
+               cur_offset = next_offset;
+               mask = 1;
+               first_round = false;
+       }
+
+       kvm_reset_dirty_gfn(kvm, cur_slot, cur_offset, mask);
+
+       trace_kvm_dirty_ring_reset(ring);
+
+       return count;
+}
+
+void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
+{
+       struct kvm_dirty_gfn *entry;
+
+       /* It should never get full */
+       WARN_ON_ONCE(kvm_dirty_ring_full(ring));
+
+       entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];
+
+       entry->slot = slot;
+       entry->offset = offset;
+       /*
+        * Make sure the data is filled in before we publish this to
+        * the userspace program.  There's no paired kernel-side reader.
+        */
+       smp_wmb();
+       kvm_dirty_gfn_set_dirtied(entry);
+       ring->dirty_index++;
+       trace_kvm_dirty_ring_push(ring, slot, offset);
+}
+
+struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)
+{
+       return vmalloc_to_page((void *)ring->dirty_gfns + offset * PAGE_SIZE);
+}
+
+void kvm_dirty_ring_free(struct kvm_dirty_ring *ring)
+{
+       vfree(ring->dirty_gfns);
+       ring->dirty_gfns = NULL;
+}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c

index 68598fdba22689da63970350e705a5d97f1f02cc..78ef414512bffa0b053cdc21cbc9a516ddb4f93d 100644 (file)
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -63,6 +63,8 @@
  #define CREATE_TRACE_POINTS
  #include <trace/events/kvm.h>
  
+#include <linux/kvm_dirty_ring.h>
+
  /* Worst case buffer size needed for holding an integer. */
  #define ITOA_MAX_LEN 12
  
@@ -415,6 +417,7 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
  
  void kvm_vcpu_destroy(struct kvm_vcpu *vcpu)
  {
+       kvm_dirty_ring_free(&vcpu->dirty_ring);
         kvm_arch_vcpu_destroy(vcpu);
  
         /*
@@ -2644,8 +2647,13 @@ void mark_page_dirty_in_slot(struct kvm *kvm,
  {
         if (memslot && memslot->dirty_bitmap) {
                 unsigned long rel_gfn = gfn - memslot->base_gfn;
+               u32 slot = (memslot->as_id << 16) | memslot->id;
  
-               set_bit_le(rel_gfn, memslot->dirty_bitmap);
+               if (kvm->dirty_ring_size)
+                       kvm_dirty_ring_push(kvm_dirty_ring_get(kvm),
+                                           slot, rel_gfn);
+               else
+                       set_bit_le(rel_gfn, memslot->dirty_bitmap);
         }
  }
  EXPORT_SYMBOL_GPL(mark_page_dirty_in_slot);
@@ -3005,6 +3013,17 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
  }
  EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
  
+static bool kvm_page_in_dirty_ring(struct kvm *kvm, unsigned long pgoff)
+{
+#if KVM_DIRTY_LOG_PAGE_OFFSET > 0
+       return (pgoff >= KVM_DIRTY_LOG_PAGE_OFFSET) &&
+           (pgoff < KVM_DIRTY_LOG_PAGE_OFFSET +
+            kvm->dirty_ring_size / PAGE_SIZE);
+#else
+       return false;
+#endif
+}
+
  static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
  {
         struct kvm_vcpu *vcpu = vmf->vma->vm_file->private_data;
@@ -3020,6 +3039,10 @@ static vm_fault_t kvm_vcpu_fault(struct vm_fault *vmf)
         else if (vmf->pgoff == KVM_COALESCED_MMIO_PAGE_OFFSET)
                 page = virt_to_page(vcpu->kvm->coalesced_mmio_ring);
  #endif
+       else if (kvm_page_in_dirty_ring(vcpu->kvm, vmf->pgoff))
+               page = kvm_dirty_ring_get_page(
+                   &vcpu->dirty_ring,
+                   vmf->pgoff - KVM_DIRTY_LOG_PAGE_OFFSET);
         else
                 return kvm_arch_vcpu_fault(vcpu, vmf);
         get_page(page);
@@ -3033,6 +3056,14 @@ static const struct vm_operations_struct kvm_vcpu_vm_ops = {
  
  static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma)
  {
+       struct kvm_vcpu *vcpu = file->private_data;
+       unsigned long pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+
+       if ((kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff) ||
+            kvm_page_in_dirty_ring(vcpu->kvm, vma->vm_pgoff + pages - 1)) &&
+           ((vma->vm_flags & VM_EXEC) || !(vma->vm_flags & VM_SHARED)))
+               return -EINVAL;
+
         vma->vm_ops = &kvm_vcpu_vm_ops;
         return 0;
  }
@@ -3126,6 +3157,13 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
         if (r)
                 goto vcpu_free_run_page;
  
+       if (kvm->dirty_ring_size) {
+               r = kvm_dirty_ring_alloc(&vcpu->dirty_ring,
+                                        id, kvm->dirty_ring_size);
+               if (r)
+                       goto arch_vcpu_destroy;
+       }
+
         mutex_lock(&kvm->lock);
         if (kvm_get_vcpu_by_id(kvm, id)) {
                 r = -EEXIST;
@@ -3159,6 +3197,8 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
  
  unlock_vcpu_destroy:
         mutex_unlock(&kvm->lock);
+       kvm_dirty_ring_free(&vcpu->dirty_ring);
+arch_vcpu_destroy:
         kvm_arch_vcpu_destroy(vcpu);
  vcpu_free_run_page:
         free_page((unsigned long)vcpu->run);
@@ -3631,12 +3671,78 @@ static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
  #endif
         case KVM_CAP_NR_MEMSLOTS:
                 return KVM_USER_MEM_SLOTS;
+       case KVM_CAP_DIRTY_LOG_RING:
+#if KVM_DIRTY_LOG_PAGE_OFFSET > 0
+               return KVM_DIRTY_RING_MAX_ENTRIES * sizeof(struct kvm_dirty_gfn);
+#else
+               return 0;
+#endif
         default:
                 break;
         }
         return kvm_vm_ioctl_check_extension(kvm, arg);
  }
  
+static int kvm_vm_ioctl_enable_dirty_log_ring(struct kvm *kvm, u32 size)
+{
+       int r;
+
+       if (!KVM_DIRTY_LOG_PAGE_OFFSET)
+               return -EINVAL;
+
+       /* the size should be power of 2 */
+       if (!size || (size & (size - 1)))
+               return -EINVAL;
+
+       /* Should be bigger to keep the reserved entries, or a page */
+       if (size < kvm_dirty_ring_get_rsvd_entries() *
+           sizeof(struct kvm_dirty_gfn) || size < PAGE_SIZE)
+               return -EINVAL;
+
+       if (size > KVM_DIRTY_RING_MAX_ENTRIES *
+           sizeof(struct kvm_dirty_gfn))
+               return -E2BIG;
+
+       /* We only allow it to set once */
+       if (kvm->dirty_ring_size)
+               return -EINVAL;
+
+       mutex_lock(&kvm->lock);
+
+       if (kvm->created_vcpus) {
+               /* We don't allow to change this value after vcpu created */
+               r = -EINVAL;
+       } else {
+               kvm->dirty_ring_size = size;
+               r = 0;
+       }
+
+       mutex_unlock(&kvm->lock);
+       return r;
+}
+
+static int kvm_vm_ioctl_reset_dirty_pages(struct kvm *kvm)
+{
+       int i;
+       struct kvm_vcpu *vcpu;
+       int cleared = 0;
+
+       if (!kvm->dirty_ring_size)
+               return -EINVAL;
+
+       mutex_lock(&kvm->slots_lock);
+
+       kvm_for_each_vcpu(i, vcpu, kvm)
+               cleared += kvm_dirty_ring_reset(vcpu->kvm, &vcpu->dirty_ring);
+
+       mutex_unlock(&kvm->slots_lock);
+
+       if (cleared)
+               kvm_flush_remote_tlbs(kvm);
+
+       return cleared;
+}
+
  int __attribute__((weak)) kvm_vm_ioctl_enable_cap(struct kvm *kvm,
                                                   struct kvm_enable_cap *cap)
  {
@@ -3667,6 +3773,8 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
                 kvm->max_halt_poll_ns = cap->args[0];
                 return 0;
         }
+       case KVM_CAP_DIRTY_LOG_RING:
+               return kvm_vm_ioctl_enable_dirty_log_ring(kvm, cap->args[0]);
         default:
                 return kvm_vm_ioctl_enable_cap(kvm, cap);
         }
@@ -3851,6 +3959,9 @@ static long kvm_vm_ioctl(struct file *filp,
         case KVM_CHECK_EXTENSION:
                 r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
                 break;
+       case KVM_RESET_DIRTY_RINGS:
+               r = kvm_vm_ioctl_reset_dirty_pages(kvm);
+               break;
         default:
                 r = kvm_arch_vm_ioctl(filp, ioctl, arg);
         }
author	Peter Xu <peterx@redhat.com>
	Thu, 1 Oct 2020 01:22:22 +0000 (21:22 -0400)
committer	Paolo Bonzini <pbonzini@redhat.com>
	Sun, 15 Nov 2020 14:49:15 +0000 (09:49 -0500)
Documentation/virt/kvm/api.rst		patch \| blob \| history
arch/x86/include/asm/kvm_host.h		patch \| blob \| history
arch/x86/include/uapi/asm/kvm.h		patch \| blob \| history
arch/x86/kvm/Makefile		patch \| blob \| history
arch/x86/kvm/mmu/mmu.c		patch \| blob \| history
arch/x86/kvm/mmu/tdp_mmu.c		patch \| blob \| history
arch/x86/kvm/vmx/vmx.c		patch \| blob \| history
arch/x86/kvm/x86.c		patch \| blob \| history
include/linux/kvm_dirty_ring.h	[new file with mode: 0644]	patch \| blob
include/linux/kvm_host.h		patch \| blob \| history
include/trace/events/kvm.h		patch \| blob \| history
include/uapi/linux/kvm.h		patch \| blob \| history
virt/kvm/dirty_ring.c	[new file with mode: 0644]	patch \| blob
virt/kvm/kvm_main.c		patch \| blob \| history