before stopping the source VM. Enabling this migration capability will
guarantee that and thus, can potentially reduce downtime even further.
-Note that currently VFIO migration is supported only for a single device. This
-is due to VFIO migration's lack of P2P support. However, P2P support is planned
-to be added later on.
+To support migration of multiple devices that might do P2P transactions between
+themselves, VFIO migration uAPI defines an intermediate P2P quiescent state.
+While in the P2P quiescent state, P2P DMA transactions cannot be initiated by
+the device, but the device can respond to incoming ones. Additionally, all
+outstanding P2P transactions are guaranteed to have been completed by the time
+the device enters this state.
+
+All the devices that support P2P migration are first transitioned to the P2P
+quiescent state and only then are they stopped or started. This makes migration
+safe P2P-wise, since starting and stopping the devices is not done atomically
+for all the devices together.
+
+Thus, multiple VFIO devices migration is allowed only if all the devices
+support P2P migration. Single VFIO device migration is allowed regardless of
+P2P migration support.
A detailed description of the UAPI for VFIO device migration can be found in
the comment for the ``vfio_device_mig_state`` structure in the header file
Flow of state changes during Live migration
===========================================
-Below is the flow of state change during live migration.
+Below is the state change flow during live migration for a VFIO device that
+supports both precopy and P2P migration. The flow for devices that don't
+support it is similar, except that the relevant states for precopy and P2P are
+skipped.
The values in the parentheses represent the VM state, the migration state, and
the VFIO device state, respectively.
-The text in the square brackets represents the flow if the VFIO device supports
-pre-copy.
Live migration save path
------------------------
::
- QEMU normal running state
- (RUNNING, _NONE, _RUNNING)
- |
+ QEMU normal running state
+ (RUNNING, _NONE, _RUNNING)
+ |
migrate_init spawns migration_thread
- Migration thread then calls each device's .save_setup()
- (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
- |
- (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
- If device is active, get pending_bytes by .state_pending_{estimate,exact}()
- If total pending_bytes >= threshold_size, call .save_live_iterate()
- [Data of VFIO device for pre-copy phase is copied]
- Iterate till total pending bytes converge and are less than threshold
- |
- On migration completion, vCPU stops and calls .save_live_complete_precopy for
- each active device. The VFIO device is then transitioned into _STOP_COPY state
- (FINISH_MIGRATE, _DEVICE, _STOP_COPY)
- |
- For the VFIO device, iterate in .save_live_complete_precopy until
- pending data is 0
- (FINISH_MIGRATE, _DEVICE, _STOP)
- |
- (FINISH_MIGRATE, _COMPLETED, _STOP)
- Migraton thread schedules cleanup bottom half and exits
+ Migration thread then calls each device's .save_setup()
+ (RUNNING, _SETUP, _PRE_COPY)
+ |
+ (RUNNING, _ACTIVE, _PRE_COPY)
+ If device is active, get pending_bytes by .state_pending_{estimate,exact}()
+ If total pending_bytes >= threshold_size, call .save_live_iterate()
+ Data of VFIO device for pre-copy phase is copied
+ Iterate till total pending bytes converge and are less than threshold
+ |
+ On migration completion, the vCPUs and the VFIO device are stopped
+ The VFIO device is first put in P2P quiescent state
+ (FINISH_MIGRATE, _ACTIVE, _PRE_COPY_P2P)
+ |
+ Then the VFIO device is put in _STOP_COPY state
+ (FINISH_MIGRATE, _ACTIVE, _STOP_COPY)
+ .save_live_complete_precopy() is called for each active device
+ For the VFIO device, iterate in .save_live_complete_precopy() until
+ pending data is 0
+ |
+ (POSTMIGRATE, _COMPLETED, _STOP_COPY)
+ Migraton thread schedules cleanup bottom half and exits
+ |
+ .save_cleanup() is called
+ (POSTMIGRATE, _COMPLETED, _STOP)
Live migration resume path
--------------------------
::
- Incoming migration calls .load_setup for each device
- (RESTORE_VM, _ACTIVE, _STOP)
- |
- For each device, .load_state is called for that device section data
- (RESTORE_VM, _ACTIVE, _RESUMING)
- |
- At the end, .load_cleanup is called for each device and vCPUs are started
- (RUNNING, _NONE, _RUNNING)
+ Incoming migration calls .load_setup() for each device
+ (RESTORE_VM, _ACTIVE, _STOP)
+ |
+ For each device, .load_state() is called for that device section data
+ (RESTORE_VM, _ACTIVE, _RESUMING)
+ |
+ At the end, .load_cleanup() is called for each device and vCPUs are started
+ The VFIO device is first put in P2P quiescent state
+ (RUNNING, _ACTIVE, _RUNNING_P2P)
+ |
+ (RUNNING, _NONE, _RUNNING)
Postcopy
========
return "STOP_COPY";
case VFIO_DEVICE_STATE_RESUMING:
return "RESUMING";
+ case VFIO_DEVICE_STATE_RUNNING_P2P:
+ return "RUNNING_P2P";
case VFIO_DEVICE_STATE_PRE_COPY:
return "PRE_COPY";
+ case VFIO_DEVICE_STATE_PRE_COPY_P2P:
+ return "PRE_COPY_P2P";
default:
return "UNKNOWN STATE";
}
/* ---------------------------------------------------------------------- */
+static void vfio_vmstate_change_prepare(void *opaque, bool running,
+ RunState state)
+{
+ VFIODevice *vbasedev = opaque;
+ VFIOMigration *migration = vbasedev->migration;
+ enum vfio_device_mig_state new_state;
+ int ret;
+
+ new_state = migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ?
+ VFIO_DEVICE_STATE_PRE_COPY_P2P :
+ VFIO_DEVICE_STATE_RUNNING_P2P;
+
+ /*
+ * If setting the device in new_state fails, the device should be reset.
+ * To do so, use ERROR state as a recover state.
+ */
+ ret = vfio_migration_set_state(vbasedev, new_state,
+ VFIO_DEVICE_STATE_ERROR);
+ if (ret) {
+ /*
+ * Migration should be aborted in this case, but vm_state_notify()
+ * currently does not support reporting failures.
+ */
+ if (migrate_get_current()->to_dst_file) {
+ qemu_file_set_error(migrate_get_current()->to_dst_file, ret);
+ }
+ }
+
+ trace_vfio_vmstate_change_prepare(vbasedev->name, running,
+ RunState_str(state),
+ mig_state_to_str(new_state));
+}
+
static void vfio_vmstate_change(void *opaque, bool running, RunState state)
{
VFIODevice *vbasedev = opaque;
char id[256] = "";
g_autofree char *path = NULL, *oid = NULL;
uint64_t mig_flags = 0;
+ VMChangeStateHandler *prepare_cb;
if (!vbasedev->ops->vfio_get_object) {
return -EINVAL;
register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers,
vbasedev);
- migration->vm_state = qdev_add_vm_change_state_handler(vbasedev->dev,
- vfio_vmstate_change,
- vbasedev);
+ prepare_cb = migration->mig_flags & VFIO_MIGRATION_P2P ?
+ vfio_vmstate_change_prepare :
+ NULL;
+ migration->vm_state = qdev_add_vm_change_state_handler_full(
+ vbasedev->dev, vfio_vmstate_change, prepare_cb, vbasedev);
migration->migration_state.notify = vfio_migration_state_notifier;
add_migration_state_change_notifier(&migration->migration_state);