From f7ebd9ed7767e01db75fe313b78cd3882ab655e3 Mon Sep 17 00:00:00 2001 From: Ralph Campbell Date: Wed, 9 Sep 2020 14:29:56 -0700 Subject: [PATCH] mm/doc: add usage description for migrate_vma_*() The migrate_vma_setup(), migrate_vma_pages(), and migrate_vma_finalize() API usage by device drivers is not well documented. Add a description for how device drivers are expected to use it. Signed-off-by: Ralph Campbell Reviewed-by: Alistair Popple Link: https://lore.kernel.org/r/20200909212956.20104-1-rcampbell@nvidia.com Signed-off-by: Jonathan Corbet --- Documentation/vm/hmm.rst | 137 +++++++++++++++++++++++++++++++++++++-- 1 file changed, 133 insertions(+), 4 deletions(-) diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst index 7453acc1ea4f3..dd9f76a4ef29a 100644 --- a/Documentation/vm/hmm.rst +++ b/Documentation/vm/hmm.rst @@ -271,10 +271,139 @@ map those pages from the CPU side. Migration to and from device memory =================================== -Because the CPU cannot access device memory, migration must use the device DMA -engine to perform copy from and to device memory. For this we need to use -migrate_vma_setup(), migrate_vma_pages(), and migrate_vma_finalize() helpers. - +Because the CPU cannot access device memory directly, the device driver must +use hardware DMA or device specific load/store instructions to migrate data. +The migrate_vma_setup(), migrate_vma_pages(), and migrate_vma_finalize() +functions are designed to make drivers easier to write and to centralize common +code across drivers. + +Before migrating pages to device private memory, special device private +``struct page`` need to be created. These will be used as special "swap" +page table entries so that a CPU process will fault if it tries to access +a page that has been migrated to device private memory. + +These can be allocated and freed with:: + + struct resource *res; + struct dev_pagemap pagemap; + + res = request_free_mem_region(&iomem_resource, /* number of bytes */, + "name of driver resource"); + pagemap.type = MEMORY_DEVICE_PRIVATE; + pagemap.range.start = res->start; + pagemap.range.end = res->end; + pagemap.nr_range = 1; + pagemap.ops = &device_devmem_ops; + memremap_pages(&pagemap, numa_node_id()); + + memunmap_pages(&pagemap); + release_mem_region(pagemap.range.start, range_len(&pagemap.range)); + +There are also devm_request_free_mem_region(), devm_memremap_pages(), +devm_memunmap_pages(), and devm_release_mem_region() when the resources can +be tied to a ``struct device``. + +The overall migration steps are similar to migrating NUMA pages within system +memory (see :ref:`Page migration `) but the steps are split +between device driver specific code and shared common code: + +1. ``mmap_read_lock()`` + + The device driver has to pass a ``struct vm_area_struct`` to + migrate_vma_setup() so the mmap_read_lock() or mmap_write_lock() needs to + be held for the duration of the migration. + +2. ``migrate_vma_setup(struct migrate_vma *args)`` + + The device driver initializes the ``struct migrate_vma`` fields and passes + the pointer to migrate_vma_setup(). The ``args->flags`` field is used to + filter which source pages should be migrated. For example, setting + ``MIGRATE_VMA_SELECT_SYSTEM`` will only migrate system memory and + ``MIGRATE_VMA_SELECT_DEVICE_PRIVATE`` will only migrate pages residing in + device private memory. If the latter flag is set, the ``args->pgmap_owner`` + field is used to identify device private pages owned by the driver. This + avoids trying to migrate device private pages residing in other devices. + Currently only anonymous private VMA ranges can be migrated to or from + system memory and device private memory. + + One of the first steps migrate_vma_setup() does is to invalidate other + device's MMUs with the ``mmu_notifier_invalidate_range_start(()`` and + ``mmu_notifier_invalidate_range_end()`` calls around the page table + walks to fill in the ``args->src`` array with PFNs to be migrated. + The ``invalidate_range_start()`` callback is passed a + ``struct mmu_notifier_range`` with the ``event`` field set to + ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to + the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is + allows the device driver to skip the invalidation callback and only + invalidate device private MMU mappings that are actually migrating. + This is explained more in the next section. + + While walking the page tables, a ``pte_none()`` or ``is_zero_pfn()`` + entry results in a valid "zero" PFN stored in the ``args->src`` array. + This lets the driver allocate device private memory and clear it instead + of copying a page of zeros. Valid PTE entries to system memory or + device private struct pages will be locked with ``lock_page()``, isolated + from the LRU (if system memory since device private pages are not on + the LRU), unmapped from the process, and a special migration PTE is + inserted in place of the original PTE. + migrate_vma_setup() also clears the ``args->dst`` array. + +3. The device driver allocates destination pages and copies source pages to + destination pages. + + The driver checks each ``src`` entry to see if the ``MIGRATE_PFN_MIGRATE`` + bit is set and skips entries that are not migrating. The device driver + can also choose to skip migrating a page by not filling in the ``dst`` + array for that page. + + The driver then allocates either a device private struct page or a + system memory page, locks the page with ``lock_page()``, and fills in the + ``dst`` array entry with:: + + dst[i] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED; + + Now that the driver knows that this page is being migrated, it can + invalidate device private MMU mappings and copy device private memory + to system memory or another device private page. The core Linux kernel + handles CPU page table invalidations so the device driver only has to + invalidate its own MMU mappings. + + The driver can use ``migrate_pfn_to_page(src[i])`` to get the + ``struct page`` of the source and either copy the source page to the + destination or clear the destination device private memory if the pointer + is ``NULL`` meaning the source page was not populated in system memory. + +4. ``migrate_vma_pages()`` + + This step is where the migration is actually "committed". + + If the source page was a ``pte_none()`` or ``is_zero_pfn()`` page, this + is where the newly allocated page is inserted into the CPU's page table. + This can fail if a CPU thread faults on the same page. However, the page + table is locked and only one of the new pages will be inserted. + The device driver will see that the ``MIGRATE_PFN_MIGRATE`` bit is cleared + if it loses the race. + + If the source page was locked, isolated, etc. the source ``struct page`` + information is now copied to destination ``struct page`` finalizing the + migration on the CPU side. + +5. Device driver updates device MMU page tables for pages still migrating, + rolling back pages not migrating. + + If the ``src`` entry still has ``MIGRATE_PFN_MIGRATE`` bit set, the device + driver can update the device MMU and set the write enable bit if the + ``MIGRATE_PFN_WRITE`` bit is set. + +6. ``migrate_vma_finalize()`` + + This step replaces the special migration page table entry with the new + page's page table entry and releases the reference to the source and + destination ``struct page``. + +7. ``mmap_read_unlock()`` + + The lock can now be released. Memory cgroup (memcg) and rss accounting ======================================== -- 2.51.0