From 1f5538a8e2fd3602cd507897c16ee647360a0423 Mon Sep 17 00:00:00 2001 From: Roman Kisel Date: Mon, 28 Apr 2025 14:07:36 -0700 Subject: [PATCH 01/16] arm64: hyperv: Initialize the Virtual Trust Level field Various parts of the hyperv code need to know what VTL the kernel runs at, most notably VMBus needs that to establish communication with the host. Initialize the Virtual Trust Level field to enable booting in the Virtual Trust Level. Signed-off-by: Roman Kisel Reviewed-by: Michael Kelley Link: https://lore.kernel.org/r/20250428210742.435282-6-romank@linux.microsoft.com Signed-off-by: Wei Liu Message-ID: <20250428210742.435282-6-romank@linux.microsoft.com> --- arch/arm64/hyperv/mshyperv.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c index 21458b6338aa..43f422a7ef34 100644 --- a/arch/arm64/hyperv/mshyperv.c +++ b/arch/arm64/hyperv/mshyperv.c @@ -117,6 +117,7 @@ static int __init hyperv_init(void) if (ms_hyperv.priv_high & HV_ACCESS_PARTITION_ID) hv_get_partition_id(); + ms_hyperv.vtl = get_vtl(); ms_hyperv_late_init(); -- 2.51.0 From e956ee9491d90c4107ca6e7c4f2210d806c89f9b Mon Sep 17 00:00:00 2001 From: Roman Kisel Date: Mon, 28 Apr 2025 14:07:37 -0700 Subject: [PATCH 02/16] arm64, x86: hyperv: Report the VTL the system boots in The hyperv guest code might run in various Virtual Trust Levels. Report the level when the kernel boots in the non-default (0) one. Signed-off-by: Roman Kisel Reviewed-by: Michael Kelley Link: https://lore.kernel.org/r/20250428210742.435282-7-romank@linux.microsoft.com Signed-off-by: Wei Liu Message-ID: <20250428210742.435282-7-romank@linux.microsoft.com> --- arch/arm64/hyperv/mshyperv.c | 2 ++ arch/x86/hyperv/hv_vtl.c | 7 ++++++- 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c index 43f422a7ef34..4fdc26ade1d7 100644 --- a/arch/arm64/hyperv/mshyperv.c +++ b/arch/arm64/hyperv/mshyperv.c @@ -118,6 +118,8 @@ static int __init hyperv_init(void) if (ms_hyperv.priv_high & HV_ACCESS_PARTITION_ID) hv_get_partition_id(); ms_hyperv.vtl = get_vtl(); + if (ms_hyperv.vtl > 0) /* non default VTL */ + pr_info("Linux runs in Hyper-V Virtual Trust Level %d\n", ms_hyperv.vtl); ms_hyperv_late_init(); diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c index 13242ed8ff16..ac3d27a766d5 100644 --- a/arch/x86/hyperv/hv_vtl.c +++ b/arch/x86/hyperv/hv_vtl.c @@ -55,7 +55,12 @@ static void __noreturn hv_vtl_restart(char __maybe_unused *cmd) void __init hv_vtl_init_platform(void) { - pr_info("Linux runs in Hyper-V Virtual Trust Level\n"); + /* + * This function is a no-op if the VTL mode is not enabled. + * If it is, this function runs if and only the kernel boots in + * VTL2 which the x86 hv initialization path makes sure of. + */ + pr_info("Linux runs in Hyper-V Virtual Trust Level %d\n", ms_hyperv.vtl); x86_platform.realmode_reserve = x86_init_noop; x86_platform.realmode_init = x86_init_noop; -- 2.51.0 From 23aa0c355921a39111f1646ef6592da20af9394c Mon Sep 17 00:00:00 2001 From: Roman Kisel Date: Mon, 28 Apr 2025 14:07:38 -0700 Subject: [PATCH 03/16] dt-bindings: microsoft,vmbus: Add interrupt and DMA coherence properties To boot in the VTL mode, VMBus on arm64 needs interrupt description which the binding documentation lacks. The transactions on the bus are DMA coherent which is not mentioned as well. Add the interrupt property and the DMA coherence property to the VMBus binding. Update the example to match that. Fix typos. Signed-off-by: Roman Kisel Reviewed-by: Krzysztof Kozlowski Link: https://lore.kernel.org/r/20250428210742.435282-8-romank@linux.microsoft.com Signed-off-by: Wei Liu Message-ID: <20250428210742.435282-8-romank@linux.microsoft.com> --- .../devicetree/bindings/bus/microsoft,vmbus.yaml | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/Documentation/devicetree/bindings/bus/microsoft,vmbus.yaml b/Documentation/devicetree/bindings/bus/microsoft,vmbus.yaml index a8d40c766dcd..0bea4f5287ce 100644 --- a/Documentation/devicetree/bindings/bus/microsoft,vmbus.yaml +++ b/Documentation/devicetree/bindings/bus/microsoft,vmbus.yaml @@ -10,8 +10,8 @@ maintainers: - Saurabh Sengar description: - VMBus is a software bus that implement the protocols for communication - between the root or host OS and guest OSs (virtual machines). + VMBus is a software bus that implements the protocols for communication + between the root or host OS and guest OS'es (virtual machines). properties: compatible: @@ -25,9 +25,16 @@ properties: '#size-cells': const: 1 + dma-coherent: true + + interrupts: + maxItems: 1 + description: Interrupt is used to report a message from the host. + required: - compatible - ranges + - interrupts - '#address-cells' - '#size-cells' @@ -35,6 +42,8 @@ additionalProperties: false examples: - | + #include + #include soc { #address-cells = <2>; #size-cells = <1>; @@ -49,6 +58,9 @@ examples: #address-cells = <2>; #size-cells = <1>; ranges = <0x0f 0xf0000000 0x0f 0xf0000000 0x10000000>; + dma-coherent; + interrupt-parent = <&gic>; + interrupts = ; }; }; }; -- 2.51.0 From 1dc5df133b98eca75d079e4485ade6b601cadf59 Mon Sep 17 00:00:00 2001 From: Roman Kisel Date: Mon, 28 Apr 2025 14:07:39 -0700 Subject: [PATCH 04/16] Drivers: hv: vmbus: Get the IRQ number from DeviceTree The VMBus driver uses ACPI for interrupt assignment on arm64 hence it won't function in the VTL mode where only DeviceTree can be used. Update the VMBus driver to discover interrupt configuration from DT. Signed-off-by: Roman Kisel Reviewed-by: Michael Kelley Link: https://lore.kernel.org/r/20250428210742.435282-9-romank@linux.microsoft.com Signed-off-by: Wei Liu Message-ID: <20250428210742.435282-9-romank@linux.microsoft.com> --- drivers/hv/vmbus_drv.c | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index e3d51a316316..d24fdb3d8f14 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -2465,6 +2465,31 @@ static int vmbus_acpi_add(struct platform_device *pdev) } #endif +static int vmbus_set_irq(struct platform_device *pdev) +{ + struct irq_data *data; + int irq; + irq_hw_number_t hwirq; + + irq = platform_get_irq(pdev, 0); + /* platform_get_irq() may not return 0. */ + if (irq < 0) + return irq; + + data = irq_get_irq_data(irq); + if (!data) { + pr_err("No interrupt data for VMBus virq %d\n", irq); + return -ENODEV; + } + hwirq = irqd_to_hwirq(data); + + vmbus_irq = irq; + vmbus_interrupt = hwirq; + pr_debug("VMBus virq %d, hwirq %d\n", vmbus_irq, vmbus_interrupt); + + return 0; +} + static int vmbus_device_add(struct platform_device *pdev) { struct resource **cur_res = &hyperv_mmio; @@ -2479,6 +2504,11 @@ static int vmbus_device_add(struct platform_device *pdev) if (ret) return ret; + if (!__is_defined(HYPERVISOR_CALLBACK_VECTOR)) + ret = vmbus_set_irq(pdev); + if (ret) + return ret; + for_each_of_range(&parser, &range) { struct resource *res; -- 2.51.0 From 18a34bb5221e2b79dbcba5bb50d92beb45b68e15 Mon Sep 17 00:00:00 2001 From: Roman Kisel Date: Mon, 28 Apr 2025 14:07:40 -0700 Subject: [PATCH 05/16] Drivers: hv: vmbus: Introduce hv_get_vmbus_root_device() The ARM64 PCI code for hyperv needs to know the VMBus root device, and it is private. Provide a function that returns it. Rename it from "hv_dev" as "hv_dev" as a symbol is very overloaded. No functional changes. Signed-off-by: Roman Kisel Reviewed-by: Michael Kelley Link: https://lore.kernel.org/r/20250428210742.435282-10-romank@linux.microsoft.com Signed-off-by: Wei Liu Message-ID: <20250428210742.435282-10-romank@linux.microsoft.com> --- drivers/hv/vmbus_drv.c | 23 +++++++++++++++-------- include/linux/hyperv.h | 2 ++ 2 files changed, 17 insertions(+), 8 deletions(-) diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c index d24fdb3d8f14..d3321700ded6 100644 --- a/drivers/hv/vmbus_drv.c +++ b/drivers/hv/vmbus_drv.c @@ -45,7 +45,8 @@ struct vmbus_dynid { struct hv_vmbus_device_id id; }; -static struct device *hv_dev; +/* VMBus Root Device */ +static struct device *vmbus_root_device; static int hyperv_cpuhp_online; @@ -80,9 +81,15 @@ static struct resource *fb_mmio; static struct resource *hyperv_mmio; static DEFINE_MUTEX(hyperv_mmio_lock); +struct device *hv_get_vmbus_root_device(void) +{ + return vmbus_root_device; +} +EXPORT_SYMBOL_GPL(hv_get_vmbus_root_device); + static int vmbus_exists(void) { - if (hv_dev == NULL) + if (vmbus_root_device == NULL) return -ENODEV; return 0; @@ -861,7 +868,7 @@ static int vmbus_dma_configure(struct device *child_device) * On x86/x64 coherence is assumed and these calls have no effect. */ hv_setup_dma_ops(child_device, - device_get_dma_attr(hv_dev) == DEV_DMA_COHERENT); + device_get_dma_attr(vmbus_root_device) == DEV_DMA_COHERENT); return 0; } @@ -2037,7 +2044,7 @@ int vmbus_device_register(struct hv_device *child_device_obj) &child_device_obj->channel->offermsg.offer.if_instance); child_device_obj->device.bus = &hv_bus; - child_device_obj->device.parent = hv_dev; + child_device_obj->device.parent = vmbus_root_device; child_device_obj->device.release = vmbus_device_release; child_device_obj->device.dma_parms = &child_device_obj->dma_parms; @@ -2412,7 +2419,7 @@ static int vmbus_acpi_add(struct platform_device *pdev) struct acpi_device *ancestor; struct acpi_device *device = ACPI_COMPANION(&pdev->dev); - hv_dev = &device->dev; + vmbus_root_device = &device->dev; /* * Older versions of Hyper-V for ARM64 fail to include the _CCA @@ -2498,7 +2505,7 @@ static int vmbus_device_add(struct platform_device *pdev) struct device_node *np = pdev->dev.of_node; int ret; - hv_dev = &pdev->dev; + vmbus_root_device = &pdev->dev; ret = of_range_parser_init(&parser, np); if (ret) @@ -2816,7 +2823,7 @@ static int __init hv_acpi_init(void) if (ret) return ret; - if (!hv_dev) { + if (!vmbus_root_device) { ret = -ENODEV; goto cleanup; } @@ -2847,7 +2854,7 @@ static int __init hv_acpi_init(void) cleanup: platform_driver_unregister(&vmbus_platform_driver); - hv_dev = NULL; + vmbus_root_device = NULL; return ret; } diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h index d6ffe01962c2..23b3c3a2ed8c 100644 --- a/include/linux/hyperv.h +++ b/include/linux/hyperv.h @@ -1283,6 +1283,8 @@ static inline void *hv_get_drvdata(struct hv_device *dev) return dev_get_drvdata(&dev->device); } +struct device *hv_get_vmbus_root_device(void); + struct hv_ring_buffer_debug_info { u32 current_interrupt_mask; u32 current_read_index; -- 2.51.0 From ab7e531a8212394608838c31e9dfac69fafa6c8a Mon Sep 17 00:00:00 2001 From: Roman Kisel Date: Mon, 28 Apr 2025 14:07:41 -0700 Subject: [PATCH 06/16] ACPI: irq: Introduce acpi_get_gsi_dispatcher() Using acpi_irq_create_hierarchy() in the cases where the code also handles OF leads to code duplication as the ACPI subsystem doesn't provide means to compute the IRQ domain parent whereas the OF does. Introduce acpi_get_gsi_dispatcher() so that the drivers relying on both ACPI and OF may use irq_domain_create_hierarchy() in the common code paths. No functional changes. Signed-off-by: Roman Kisel Reviewed-by: Michael Kelley Acked-by: Rafael J. Wysocki Link: https://lore.kernel.org/r/20250428210742.435282-11-romank@linux.microsoft.com Signed-off-by: Wei Liu Message-ID: <20250428210742.435282-11-romank@linux.microsoft.com> --- drivers/acpi/irq.c | 16 ++++++++++++++-- include/linux/acpi.h | 5 ++++- 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/drivers/acpi/irq.c b/drivers/acpi/irq.c index 1687483ff319..76a856c32c4d 100644 --- a/drivers/acpi/irq.c +++ b/drivers/acpi/irq.c @@ -12,7 +12,7 @@ enum acpi_irq_model_id acpi_irq_model; -static struct fwnode_handle *(*acpi_get_gsi_domain_id)(u32 gsi); +static acpi_gsi_domain_disp_fn acpi_get_gsi_domain_id; static u32 (*acpi_gsi_to_irq_fallback)(u32 gsi); /** @@ -307,12 +307,24 @@ EXPORT_SYMBOL_GPL(acpi_irq_get); * for a given GSI */ void __init acpi_set_irq_model(enum acpi_irq_model_id model, - struct fwnode_handle *(*fn)(u32)) + acpi_gsi_domain_disp_fn fn) { acpi_irq_model = model; acpi_get_gsi_domain_id = fn; } +/* + * acpi_get_gsi_dispatcher() - Get the GSI dispatcher function + * + * Return the dispatcher function that computes the domain fwnode for + * a given GSI. + */ +acpi_gsi_domain_disp_fn acpi_get_gsi_dispatcher(void) +{ + return acpi_get_gsi_domain_id; +} +EXPORT_SYMBOL_GPL(acpi_get_gsi_dispatcher); + /** * acpi_set_gsi_to_irq_fallback - Register a GSI transfer * callback to fallback to arch specified implementation. diff --git a/include/linux/acpi.h b/include/linux/acpi.h index 3f2e93ed9730..14e8871b5787 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -335,8 +335,11 @@ int acpi_register_gsi (struct device *dev, u32 gsi, int triggering, int polarity int acpi_gsi_to_irq (u32 gsi, unsigned int *irq); int acpi_isa_irq_to_gsi (unsigned isa_irq, u32 *gsi); +typedef struct fwnode_handle *(*acpi_gsi_domain_disp_fn)(u32); + void acpi_set_irq_model(enum acpi_irq_model_id model, - struct fwnode_handle *(*)(u32)); + acpi_gsi_domain_disp_fn fn); +acpi_gsi_domain_disp_fn acpi_get_gsi_dispatcher(void); void acpi_set_gsi_to_irq_fallback(u32 (*)(u32)); struct irq_domain *acpi_irq_create_hierarchy(unsigned int flags, -- 2.51.0 From d684f9b28809b783e8473727fdf14595b36d8fd3 Mon Sep 17 00:00:00 2001 From: Roman Kisel Date: Mon, 28 Apr 2025 14:07:42 -0700 Subject: [PATCH 07/16] PCI: hv: Get vPCI MSI IRQ domain from DeviceTree The hyperv-pci driver uses ACPI for MSI IRQ domain configuration on arm64. It won't be able to do that in the VTL mode where only DeviceTree can be used. Update the hyperv-pci driver to get vPCI MSI IRQ domain in the DeviceTree case, too. Signed-off-by: Roman Kisel Acked-by: Bjorn Helgaas Reviewed-by: Michael Kelley Link: https://lore.kernel.org/r/20250428210742.435282-12-romank@linux.microsoft.com Signed-off-by: Wei Liu Message-ID: <20250428210742.435282-12-romank@linux.microsoft.com> --- drivers/pci/controller/pci-hyperv.c | 70 ++++++++++++++++++++++++++--- 1 file changed, 64 insertions(+), 6 deletions(-) diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index ac27bda5ba26..682f1ca2d684 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -50,6 +50,7 @@ #include #include #include +#include #include /* @@ -817,9 +818,17 @@ static int hv_pci_vec_irq_gic_domain_alloc(struct irq_domain *domain, int ret; fwspec.fwnode = domain->parent->fwnode; - fwspec.param_count = 2; - fwspec.param[0] = hwirq; - fwspec.param[1] = IRQ_TYPE_EDGE_RISING; + if (is_of_node(fwspec.fwnode)) { + /* SPI lines for OF translations start at offset 32 */ + fwspec.param_count = 3; + fwspec.param[0] = 0; + fwspec.param[1] = hwirq - 32; + fwspec.param[2] = IRQ_TYPE_EDGE_RISING; + } else { + fwspec.param_count = 2; + fwspec.param[0] = hwirq; + fwspec.param[1] = IRQ_TYPE_EDGE_RISING; + } ret = irq_domain_alloc_irqs_parent(domain, virq, 1, &fwspec); if (ret) @@ -887,10 +896,44 @@ static const struct irq_domain_ops hv_pci_domain_ops = { .activate = hv_pci_vec_irq_domain_activate, }; +#ifdef CONFIG_OF + +static struct irq_domain *hv_pci_of_irq_domain_parent(void) +{ + struct device_node *parent; + struct irq_domain *domain; + + parent = of_irq_find_parent(hv_get_vmbus_root_device()->of_node); + if (!parent) + return NULL; + domain = irq_find_host(parent); + of_node_put(parent); + + return domain; +} + +#endif + +#ifdef CONFIG_ACPI + +static struct irq_domain *hv_pci_acpi_irq_domain_parent(void) +{ + acpi_gsi_domain_disp_fn gsi_domain_disp_fn; + + gsi_domain_disp_fn = acpi_get_gsi_dispatcher(); + if (!gsi_domain_disp_fn) + return NULL; + return irq_find_matching_fwnode(gsi_domain_disp_fn(0), + DOMAIN_BUS_ANY); +} + +#endif + static int hv_pci_irqchip_init(void) { static struct hv_pci_chip_data *chip_data; struct fwnode_handle *fn = NULL; + struct irq_domain *irq_domain_parent = NULL; int ret = -ENOMEM; chip_data = kzalloc(sizeof(*chip_data), GFP_KERNEL); @@ -907,9 +950,24 @@ static int hv_pci_irqchip_init(void) * way to ensure that all the corresponding devices are also gone and * no interrupts will be generated. */ - hv_msi_gic_irq_domain = acpi_irq_create_hierarchy(0, HV_PCI_MSI_SPI_NR, - fn, &hv_pci_domain_ops, - chip_data); +#ifdef CONFIG_ACPI + if (!acpi_disabled) + irq_domain_parent = hv_pci_acpi_irq_domain_parent(); +#endif +#ifdef CONFIG_OF + if (!irq_domain_parent) + irq_domain_parent = hv_pci_of_irq_domain_parent(); +#endif + if (!irq_domain_parent) { + WARN_ONCE(1, "Invalid firmware configuration for VMBus interrupts\n"); + ret = -EINVAL; + goto free_chip; + } + + hv_msi_gic_irq_domain = irq_domain_create_hierarchy(irq_domain_parent, 0, + HV_PCI_MSI_SPI_NR, + fn, &hv_pci_domain_ops, + chip_data); if (!hv_msi_gic_irq_domain) { pr_err("Failed to create Hyper-V arm64 vPCI MSI IRQ domain\n"); -- 2.51.0 From 86c48271e0d60c82665e9fd61277002391efcef7 Mon Sep 17 00:00:00 2001 From: Roman Kisel Date: Wed, 7 May 2025 11:22:25 -0700 Subject: [PATCH 08/16] x86/hyperv: Fix APIC ID and VP index confusion in hv_snp_boot_ap() To start an application processor in SNP-isolated guest, a hypercall is used that takes a virtual processor index. The hv_snp_boot_ap() function uses that START_VP hypercall but passes as VP index to it what it receives as a wakeup_secondary_cpu_64 callback: the APIC ID. As those two aren't generally interchangeable, that may lead to hung APs if the VP index and the APIC ID don't match up. Update the parameter names to avoid confusion as to what the parameter is. Use the APIC ID to the VP index conversion to provide the correct input to the hypercall. Cc: stable@vger.kernel.org Fixes: 44676bb9d566 ("x86/hyperv: Add smp support for SEV-SNP guest") Signed-off-by: Roman Kisel Reviewed-by: Michael Kelley Link: https://lore.kernel.org/r/20250507182227.7421-2-romank@linux.microsoft.com Signed-off-by: Wei Liu Message-ID: <20250507182227.7421-2-romank@linux.microsoft.com> --- arch/x86/hyperv/hv_init.c | 33 +++++++++++++++++++++++++ arch/x86/hyperv/hv_vtl.c | 44 +++++---------------------------- arch/x86/hyperv/ivm.c | 22 +++++++++++++++-- arch/x86/include/asm/mshyperv.h | 6 +++-- include/hyperv/hvgdk_mini.h | 2 +- 5 files changed, 64 insertions(+), 43 deletions(-) diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index 3b569291dfed..9a8fc144e195 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -672,3 +672,36 @@ bool hv_is_hyperv_initialized(void) return hypercall_msr.enable; } EXPORT_SYMBOL_GPL(hv_is_hyperv_initialized); + +int hv_apicid_to_vp_index(u32 apic_id) +{ + u64 control; + u64 status; + unsigned long irq_flags; + struct hv_get_vp_from_apic_id_in *input; + u32 *output, ret; + + local_irq_save(irq_flags); + + input = *this_cpu_ptr(hyperv_pcpu_input_arg); + memset(input, 0, sizeof(*input)); + input->partition_id = HV_PARTITION_ID_SELF; + input->apic_ids[0] = apic_id; + + output = *this_cpu_ptr(hyperv_pcpu_output_arg); + + control = HV_HYPERCALL_REP_COMP_1 | HVCALL_GET_VP_INDEX_FROM_APIC_ID; + status = hv_do_hypercall(control, input, output); + ret = output[0]; + + local_irq_restore(irq_flags); + + if (!hv_result_success(status)) { + pr_err("failed to get vp index from apic id %d, status %#llx\n", + apic_id, status); + return -EINVAL; + } + + return ret; +} +EXPORT_SYMBOL_GPL(hv_apicid_to_vp_index); diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c index ac3d27a766d5..2f32ac1ae40e 100644 --- a/arch/x86/hyperv/hv_vtl.c +++ b/arch/x86/hyperv/hv_vtl.c @@ -211,41 +211,9 @@ free_lock: return ret; } -static int hv_vtl_apicid_to_vp_id(u32 apic_id) -{ - u64 control; - u64 status; - unsigned long irq_flags; - struct hv_get_vp_from_apic_id_in *input; - u32 *output, ret; - - local_irq_save(irq_flags); - - input = *this_cpu_ptr(hyperv_pcpu_input_arg); - memset(input, 0, sizeof(*input)); - input->partition_id = HV_PARTITION_ID_SELF; - input->apic_ids[0] = apic_id; - - output = *this_cpu_ptr(hyperv_pcpu_output_arg); - - control = HV_HYPERCALL_REP_COMP_1 | HVCALL_GET_VP_ID_FROM_APIC_ID; - status = hv_do_hypercall(control, input, output); - ret = output[0]; - - local_irq_restore(irq_flags); - - if (!hv_result_success(status)) { - pr_err("failed to get vp id from apic id %d, status %#llx\n", - apic_id, status); - return -EINVAL; - } - - return ret; -} - static int hv_vtl_wakeup_secondary_cpu(u32 apicid, unsigned long start_eip) { - int vp_id, cpu; + int vp_index, cpu; /* Find the logical CPU for the APIC ID */ for_each_present_cpu(cpu) { @@ -256,18 +224,18 @@ static int hv_vtl_wakeup_secondary_cpu(u32 apicid, unsigned long start_eip) return -EINVAL; pr_debug("Bringing up CPU with APIC ID %d in VTL2...\n", apicid); - vp_id = hv_vtl_apicid_to_vp_id(apicid); + vp_index = hv_apicid_to_vp_index(apicid); - if (vp_id < 0) { + if (vp_index < 0) { pr_err("Couldn't find CPU with APIC ID %d\n", apicid); return -EINVAL; } - if (vp_id > ms_hyperv.max_vp_index) { - pr_err("Invalid CPU id %d for APIC ID %d\n", vp_id, apicid); + if (vp_index > ms_hyperv.max_vp_index) { + pr_err("Invalid CPU id %d for APIC ID %d\n", vp_index, apicid); return -EINVAL; } - return hv_vtl_bringup_vcpu(vp_id, cpu, start_eip); + return hv_vtl_bringup_vcpu(vp_index, cpu, start_eip); } int __init hv_vtl_early_init(void) diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c index 77bf05f06b9e..0cc239cdb4da 100644 --- a/arch/x86/hyperv/ivm.c +++ b/arch/x86/hyperv/ivm.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -288,7 +289,7 @@ static void snp_cleanup_vmsa(struct sev_es_save_area *vmsa) free_page((unsigned long)vmsa); } -int hv_snp_boot_ap(u32 cpu, unsigned long start_ip) +int hv_snp_boot_ap(u32 apic_id, unsigned long start_ip) { struct sev_es_save_area *vmsa = (struct sev_es_save_area *) __get_free_page(GFP_KERNEL | __GFP_ZERO); @@ -297,10 +298,27 @@ int hv_snp_boot_ap(u32 cpu, unsigned long start_ip) u64 ret, retry = 5; struct hv_enable_vp_vtl *start_vp_input; unsigned long flags; + int cpu, vp_index; if (!vmsa) return -ENOMEM; + /* Find the Hyper-V VP index which might be not the same as APIC ID */ + vp_index = hv_apicid_to_vp_index(apic_id); + if (vp_index < 0 || vp_index > ms_hyperv.max_vp_index) + return -EINVAL; + + /* + * Find the Linux CPU number for addressing the per-CPU data, and it + * might not be the same as APIC ID. + */ + for_each_present_cpu(cpu) { + if (arch_match_cpu_phys_id(cpu, apic_id)) + break; + } + if (cpu >= nr_cpu_ids) + return -EINVAL; + native_store_gdt(&gdtr); vmsa->gdtr.base = gdtr.address; @@ -348,7 +366,7 @@ int hv_snp_boot_ap(u32 cpu, unsigned long start_ip) start_vp_input = (struct hv_enable_vp_vtl *)ap_start_input_arg; memset(start_vp_input, 0, sizeof(*start_vp_input)); start_vp_input->partition_id = -1; - start_vp_input->vp_index = cpu; + start_vp_input->vp_index = vp_index; start_vp_input->target_vtl.target_vtl = ms_hyperv.vtl; *(u64 *)&start_vp_input->vp_context = __pa(vmsa) | 1; diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h index bab5ccfc60a7..0b9a3a307d06 100644 --- a/arch/x86/include/asm/mshyperv.h +++ b/arch/x86/include/asm/mshyperv.h @@ -268,11 +268,11 @@ int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry); #ifdef CONFIG_AMD_MEM_ENCRYPT bool hv_ghcb_negotiate_protocol(void); void __noreturn hv_ghcb_terminate(unsigned int set, unsigned int reason); -int hv_snp_boot_ap(u32 cpu, unsigned long start_ip); +int hv_snp_boot_ap(u32 apic_id, unsigned long start_ip); #else static inline bool hv_ghcb_negotiate_protocol(void) { return false; } static inline void hv_ghcb_terminate(unsigned int set, unsigned int reason) {} -static inline int hv_snp_boot_ap(u32 cpu, unsigned long start_ip) { return 0; } +static inline int hv_snp_boot_ap(u32 apic_id, unsigned long start_ip) { return 0; } #endif #if defined(CONFIG_AMD_MEM_ENCRYPT) || defined(CONFIG_INTEL_TDX_GUEST) @@ -306,6 +306,7 @@ static __always_inline u64 hv_raw_get_msr(unsigned int reg) { return __rdmsr(reg); } +int hv_apicid_to_vp_index(u32 apic_id); #else /* CONFIG_HYPERV */ static inline void hyperv_init(void) {} @@ -327,6 +328,7 @@ static inline void hv_set_msr(unsigned int reg, u64 value) { } static inline u64 hv_get_msr(unsigned int reg) { return 0; } static inline void hv_set_non_nested_msr(unsigned int reg, u64 value) { } static inline u64 hv_get_non_nested_msr(unsigned int reg) { return 0; } +static inline int hv_apicid_to_vp_index(u32 apic_id) { return -EINVAL; } #endif /* CONFIG_HYPERV */ diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h index cf0923dc727d..2d431b53f587 100644 --- a/include/hyperv/hvgdk_mini.h +++ b/include/hyperv/hvgdk_mini.h @@ -475,7 +475,7 @@ union hv_vp_assist_msr_contents { /* HV_REGISTER_VP_ASSIST_PAGE */ #define HVCALL_CREATE_PORT 0x0095 #define HVCALL_CONNECT_PORT 0x0096 #define HVCALL_START_VP 0x0099 -#define HVCALL_GET_VP_ID_FROM_APIC_ID 0x009a +#define HVCALL_GET_VP_INDEX_FROM_APIC_ID 0x009a #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_SPACE 0x00af #define HVCALL_FLUSH_GUEST_PHYSICAL_ADDRESS_LIST 0x00b0 #define HVCALL_SIGNAL_EVENT_DIRECT 0x00c0 -- 2.51.0 From 43cb39ad260a168abf6da3a48f394afff92d436b Mon Sep 17 00:00:00 2001 From: Roman Kisel Date: Wed, 7 May 2025 11:22:26 -0700 Subject: [PATCH 09/16] arch/x86: Provide the CPU number in the wakeup AP callback When starting APs, confidential guests and paravisor guests need to know the CPU number, and the pattern of using the linear search has emerged in several places. With N processors that leads to the O(N^2) time complexity. Provide the CPU number in the AP wake up callback so that one can get the CPU number in constant time. Suggested-by: Michael Kelley Signed-off-by: Roman Kisel Reviewed-by: Tom Lendacky Reviewed-by: Thomas Gleixner Reviewed-by: Michael Kelley Acked-by: Rafael J. Wysocki Link: https://lore.kernel.org/r/20250507182227.7421-3-romank@linux.microsoft.com Signed-off-by: Wei Liu Message-ID: <20250507182227.7421-3-romank@linux.microsoft.com> --- arch/x86/coco/sev/core.c | 13 ++----------- arch/x86/hyperv/hv_vtl.c | 12 ++---------- arch/x86/hyperv/ivm.c | 15 ++------------- arch/x86/include/asm/apic.h | 8 ++++---- arch/x86/include/asm/mshyperv.h | 5 +++-- arch/x86/kernel/acpi/madt_wakeup.c | 2 +- arch/x86/kernel/apic/apic_noop.c | 8 +++++++- arch/x86/kernel/apic/apic_numachip.c | 2 +- arch/x86/kernel/apic/x2apic_uv_x.c | 2 +- arch/x86/kernel/smpboot.c | 10 +++++----- 10 files changed, 28 insertions(+), 49 deletions(-) diff --git a/arch/x86/coco/sev/core.c b/arch/x86/coco/sev/core.c index b0c1a7a57497..7780d55d1833 100644 --- a/arch/x86/coco/sev/core.c +++ b/arch/x86/coco/sev/core.c @@ -1177,7 +1177,7 @@ static void snp_cleanup_vmsa(struct sev_es_save_area *vmsa, int apic_id) free_page((unsigned long)vmsa); } -static int wakeup_cpu_via_vmgexit(u32 apic_id, unsigned long start_ip) +static int wakeup_cpu_via_vmgexit(u32 apic_id, unsigned long start_ip, unsigned int cpu) { struct sev_es_save_area *cur_vmsa, *vmsa; struct ghcb_state state; @@ -1185,7 +1185,7 @@ static int wakeup_cpu_via_vmgexit(u32 apic_id, unsigned long start_ip) unsigned long flags; struct ghcb *ghcb; u8 sipi_vector; - int cpu, ret; + int ret; u64 cr4; /* @@ -1206,15 +1206,6 @@ static int wakeup_cpu_via_vmgexit(u32 apic_id, unsigned long start_ip) /* Override start_ip with known protected guest start IP */ start_ip = real_mode_header->sev_es_trampoline_start; - - /* Find the logical CPU for the APIC ID */ - for_each_present_cpu(cpu) { - if (arch_match_cpu_phys_id(cpu, apic_id)) - break; - } - if (cpu >= nr_cpu_ids) - return -EINVAL; - cur_vmsa = per_cpu(sev_vmsa, cpu); /* diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c index 2f32ac1ae40e..3d149a2ca4c8 100644 --- a/arch/x86/hyperv/hv_vtl.c +++ b/arch/x86/hyperv/hv_vtl.c @@ -211,17 +211,9 @@ free_lock: return ret; } -static int hv_vtl_wakeup_secondary_cpu(u32 apicid, unsigned long start_eip) +static int hv_vtl_wakeup_secondary_cpu(u32 apicid, unsigned long start_eip, unsigned int cpu) { - int vp_index, cpu; - - /* Find the logical CPU for the APIC ID */ - for_each_present_cpu(cpu) { - if (arch_match_cpu_phys_id(cpu, apicid)) - break; - } - if (cpu >= nr_cpu_ids) - return -EINVAL; + int vp_index; pr_debug("Bringing up CPU with APIC ID %d in VTL2...\n", apicid); vp_index = hv_apicid_to_vp_index(apicid); diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c index 0cc239cdb4da..e21557b24d19 100644 --- a/arch/x86/hyperv/ivm.c +++ b/arch/x86/hyperv/ivm.c @@ -289,7 +289,7 @@ static void snp_cleanup_vmsa(struct sev_es_save_area *vmsa) free_page((unsigned long)vmsa); } -int hv_snp_boot_ap(u32 apic_id, unsigned long start_ip) +int hv_snp_boot_ap(u32 apic_id, unsigned long start_ip, unsigned int cpu) { struct sev_es_save_area *vmsa = (struct sev_es_save_area *) __get_free_page(GFP_KERNEL | __GFP_ZERO); @@ -298,7 +298,7 @@ int hv_snp_boot_ap(u32 apic_id, unsigned long start_ip) u64 ret, retry = 5; struct hv_enable_vp_vtl *start_vp_input; unsigned long flags; - int cpu, vp_index; + int vp_index; if (!vmsa) return -ENOMEM; @@ -308,17 +308,6 @@ int hv_snp_boot_ap(u32 apic_id, unsigned long start_ip) if (vp_index < 0 || vp_index > ms_hyperv.max_vp_index) return -EINVAL; - /* - * Find the Linux CPU number for addressing the per-CPU data, and it - * might not be the same as APIC ID. - */ - for_each_present_cpu(cpu) { - if (arch_match_cpu_phys_id(cpu, apic_id)) - break; - } - if (cpu >= nr_cpu_ids) - return -EINVAL; - native_store_gdt(&gdtr); vmsa->gdtr.base = gdtr.address; diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index c903d358405d..eaf43d446203 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -313,9 +313,9 @@ struct apic { u32 (*get_apic_id)(u32 id); /* wakeup_secondary_cpu */ - int (*wakeup_secondary_cpu)(u32 apicid, unsigned long start_eip); + int (*wakeup_secondary_cpu)(u32 apicid, unsigned long start_eip, unsigned int cpu); /* wakeup secondary CPU using 64-bit wakeup point */ - int (*wakeup_secondary_cpu_64)(u32 apicid, unsigned long start_eip); + int (*wakeup_secondary_cpu_64)(u32 apicid, unsigned long start_eip, unsigned int cpu); char *name; }; @@ -333,8 +333,8 @@ struct apic_override { void (*send_IPI_self)(int vector); u64 (*icr_read)(void); void (*icr_write)(u32 low, u32 high); - int (*wakeup_secondary_cpu)(u32 apicid, unsigned long start_eip); - int (*wakeup_secondary_cpu_64)(u32 apicid, unsigned long start_eip); + int (*wakeup_secondary_cpu)(u32 apicid, unsigned long start_eip, unsigned int cpu); + int (*wakeup_secondary_cpu_64)(u32 apicid, unsigned long start_eip, unsigned int cpu); }; /* diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h index 0b9a3a307d06..5ec92e3e2e37 100644 --- a/arch/x86/include/asm/mshyperv.h +++ b/arch/x86/include/asm/mshyperv.h @@ -268,11 +268,12 @@ int hv_unmap_ioapic_interrupt(int ioapic_id, struct hv_interrupt_entry *entry); #ifdef CONFIG_AMD_MEM_ENCRYPT bool hv_ghcb_negotiate_protocol(void); void __noreturn hv_ghcb_terminate(unsigned int set, unsigned int reason); -int hv_snp_boot_ap(u32 apic_id, unsigned long start_ip); +int hv_snp_boot_ap(u32 apic_id, unsigned long start_ip, unsigned int cpu); #else static inline bool hv_ghcb_negotiate_protocol(void) { return false; } static inline void hv_ghcb_terminate(unsigned int set, unsigned int reason) {} -static inline int hv_snp_boot_ap(u32 apic_id, unsigned long start_ip) { return 0; } +static inline int hv_snp_boot_ap(u32 apic_id, unsigned long start_ip, + unsigned int cpu) { return 0; } #endif #if defined(CONFIG_AMD_MEM_ENCRYPT) || defined(CONFIG_INTEL_TDX_GUEST) diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c index f36f28405dcc..6d7603511f52 100644 --- a/arch/x86/kernel/acpi/madt_wakeup.c +++ b/arch/x86/kernel/acpi/madt_wakeup.c @@ -126,7 +126,7 @@ static int __init acpi_mp_setup_reset(u64 reset_vector) return 0; } -static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip) +static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip, unsigned int cpu) { if (!acpi_mp_wake_mailbox_paddr) { pr_warn_once("No MADT mailbox: cannot bringup secondary CPUs. Booting with kexec?\n"); diff --git a/arch/x86/kernel/apic/apic_noop.c b/arch/x86/kernel/apic/apic_noop.c index b5bb7a2e8340..58abb941c45b 100644 --- a/arch/x86/kernel/apic/apic_noop.c +++ b/arch/x86/kernel/apic/apic_noop.c @@ -27,7 +27,13 @@ static void noop_send_IPI_allbutself(int vector) { } static void noop_send_IPI_all(int vector) { } static void noop_send_IPI_self(int vector) { } static void noop_apic_icr_write(u32 low, u32 id) { } -static int noop_wakeup_secondary_cpu(u32 apicid, unsigned long start_eip) { return -1; } + +static int noop_wakeup_secondary_cpu(u32 apicid, unsigned long start_eip, + unsigned int cpu) +{ + return -1; +} + static u64 noop_apic_icr_read(void) { return 0; } static u32 noop_get_apic_id(u32 apicid) { return 0; } static void noop_apic_eoi(void) { } diff --git a/arch/x86/kernel/apic/apic_numachip.c b/arch/x86/kernel/apic/apic_numachip.c index 16410f087b7a..333536b89bde 100644 --- a/arch/x86/kernel/apic/apic_numachip.c +++ b/arch/x86/kernel/apic/apic_numachip.c @@ -56,7 +56,7 @@ static void numachip2_apic_icr_write(int apicid, unsigned int val) numachip2_write32_lcsr(NUMACHIP2_APIC_ICR, (apicid << 12) | val); } -static int numachip_wakeup_secondary(u32 phys_apicid, unsigned long start_rip) +static int numachip_wakeup_secondary(u32 phys_apicid, unsigned long start_rip, unsigned int cpu) { numachip_apic_icr_write(phys_apicid, APIC_DM_INIT); numachip_apic_icr_write(phys_apicid, APIC_DM_STARTUP | diff --git a/arch/x86/kernel/apic/x2apic_uv_x.c b/arch/x86/kernel/apic/x2apic_uv_x.c index 7fef504ca508..15209f220e1f 100644 --- a/arch/x86/kernel/apic/x2apic_uv_x.c +++ b/arch/x86/kernel/apic/x2apic_uv_x.c @@ -667,7 +667,7 @@ static __init void build_uv_gr_table(void) } } -static int uv_wakeup_secondary(u32 phys_apicid, unsigned long start_rip) +static int uv_wakeup_secondary(u32 phys_apicid, unsigned long start_rip, unsigned int cpu) { unsigned long val; int pnode; diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index d6cf1e23c2a3..d52e9238e9fd 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -695,7 +695,7 @@ static void send_init_sequence(u32 phys_apicid) /* * Wake up AP by INIT, INIT, STARTUP sequence. */ -static int wakeup_secondary_cpu_via_init(u32 phys_apicid, unsigned long start_eip) +static int wakeup_secondary_cpu_via_init(u32 phys_apicid, unsigned long start_eip, unsigned int cpu) { unsigned long send_status = 0, accept_status = 0; int num_starts, j, maxlvt; @@ -842,7 +842,7 @@ int common_cpu_up(unsigned int cpu, struct task_struct *idle) * Returns zero if startup was successfully sent, else error code from * ->wakeup_secondary_cpu. */ -static int do_boot_cpu(u32 apicid, int cpu, struct task_struct *idle) +static int do_boot_cpu(u32 apicid, unsigned int cpu, struct task_struct *idle) { unsigned long start_ip = real_mode_header->trampoline_start; int ret; @@ -896,11 +896,11 @@ static int do_boot_cpu(u32 apicid, int cpu, struct task_struct *idle) * - Use an INIT boot APIC message */ if (apic->wakeup_secondary_cpu_64) - ret = apic->wakeup_secondary_cpu_64(apicid, start_ip); + ret = apic->wakeup_secondary_cpu_64(apicid, start_ip, cpu); else if (apic->wakeup_secondary_cpu) - ret = apic->wakeup_secondary_cpu(apicid, start_ip); + ret = apic->wakeup_secondary_cpu(apicid, start_ip, cpu); else - ret = wakeup_secondary_cpu_via_init(apicid, start_ip); + ret = wakeup_secondary_cpu_via_init(apicid, start_ip, cpu); /* If the wakeup mechanism failed, cleanup the warm reset vector */ if (ret) -- 2.51.0 From 09eea7ad0b8e973dcf5ed49902838e5d68177f8e Mon Sep 17 00:00:00 2001 From: Long Li Date: Mon, 5 May 2025 17:56:33 -0700 Subject: [PATCH 10/16] Drivers: hv: Allocate interrupt and monitor pages aligned to system page boundary There are use cases that interrupt and monitor pages are mapped to user-mode through UIO, so they need to be system page aligned. Some Hyper-V allocation APIs introduced earlier broke those requirements. Fix this by using page allocation functions directly for interrupt and monitor pages. Cc: stable@vger.kernel.org Fixes: ca48739e59df ("Drivers: hv: vmbus: Move Hyper-V page allocator to arch neutral code") Signed-off-by: Long Li Reviewed-by: Michael Kelley Link: https://lore.kernel.org/r/1746492997-4599-2-git-send-email-longli@linuxonhyperv.com Signed-off-by: Wei Liu Message-ID: <1746492997-4599-2-git-send-email-longli@linuxonhyperv.com> --- drivers/hv/connection.c | 23 +++++++++++++++++------ 1 file changed, 17 insertions(+), 6 deletions(-) diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c index 8351360bba16..be490c598785 100644 --- a/drivers/hv/connection.c +++ b/drivers/hv/connection.c @@ -206,11 +206,20 @@ int vmbus_connect(void) INIT_LIST_HEAD(&vmbus_connection.chn_list); mutex_init(&vmbus_connection.channel_mutex); + /* + * The following Hyper-V interrupt and monitor pages can be used by + * UIO for mapping to user-space, so they should always be allocated on + * system page boundaries. The system page size must be >= the Hyper-V + * page size. + */ + BUILD_BUG_ON(PAGE_SIZE < HV_HYP_PAGE_SIZE); + /* * Setup the vmbus event connection for channel interrupt * abstraction stuff */ - vmbus_connection.int_page = hv_alloc_hyperv_zeroed_page(); + vmbus_connection.int_page = + (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO); if (vmbus_connection.int_page == NULL) { ret = -ENOMEM; goto cleanup; @@ -225,8 +234,8 @@ int vmbus_connect(void) * Setup the monitor notification facility. The 1st page for * parent->child and the 2nd page for child->parent */ - vmbus_connection.monitor_pages[0] = hv_alloc_hyperv_page(); - vmbus_connection.monitor_pages[1] = hv_alloc_hyperv_page(); + vmbus_connection.monitor_pages[0] = (void *)__get_free_page(GFP_KERNEL); + vmbus_connection.monitor_pages[1] = (void *)__get_free_page(GFP_KERNEL); if ((vmbus_connection.monitor_pages[0] == NULL) || (vmbus_connection.monitor_pages[1] == NULL)) { ret = -ENOMEM; @@ -342,21 +351,23 @@ void vmbus_disconnect(void) destroy_workqueue(vmbus_connection.work_queue); if (vmbus_connection.int_page) { - hv_free_hyperv_page(vmbus_connection.int_page); + free_page((unsigned long)vmbus_connection.int_page); vmbus_connection.int_page = NULL; } if (vmbus_connection.monitor_pages[0]) { if (!set_memory_encrypted( (unsigned long)vmbus_connection.monitor_pages[0], 1)) - hv_free_hyperv_page(vmbus_connection.monitor_pages[0]); + free_page((unsigned long) + vmbus_connection.monitor_pages[0]); vmbus_connection.monitor_pages[0] = NULL; } if (vmbus_connection.monitor_pages[1]) { if (!set_memory_encrypted( (unsigned long)vmbus_connection.monitor_pages[1], 1)) - hv_free_hyperv_page(vmbus_connection.monitor_pages[1]); + free_page((unsigned long) + vmbus_connection.monitor_pages[1]); vmbus_connection.monitor_pages[1] = NULL; } } -- 2.51.0 From c951ab8fd3589cf6991ed4111d2130816f2e3ac2 Mon Sep 17 00:00:00 2001 From: Long Li Date: Mon, 5 May 2025 17:56:34 -0700 Subject: [PATCH 11/16] uio_hv_generic: Use correct size for interrupt and monitor pages Interrupt and monitor pages should be in Hyper-V page size (4k bytes). This can be different from the system page size. This size is read and used by the user-mode program to determine the mapped data region. An example of such user-mode program is the VMBus driver in DPDK. Cc: stable@vger.kernel.org Fixes: 95096f2fbd10 ("uio-hv-generic: new userspace i/o driver for VMBus") Signed-off-by: Long Li Reviewed-by: Michael Kelley Link: https://lore.kernel.org/r/1746492997-4599-3-git-send-email-longli@linuxonhyperv.com Signed-off-by: Wei Liu Message-ID: <1746492997-4599-3-git-send-email-longli@linuxonhyperv.com> --- drivers/uio/uio_hv_generic.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/uio/uio_hv_generic.c b/drivers/uio/uio_hv_generic.c index 69c1df0f4ca5..cc3a350dbbd5 100644 --- a/drivers/uio/uio_hv_generic.c +++ b/drivers/uio/uio_hv_generic.c @@ -274,13 +274,13 @@ hv_uio_probe(struct hv_device *dev, pdata->info.mem[INT_PAGE_MAP].name = "int_page"; pdata->info.mem[INT_PAGE_MAP].addr = (uintptr_t)vmbus_connection.int_page; - pdata->info.mem[INT_PAGE_MAP].size = PAGE_SIZE; + pdata->info.mem[INT_PAGE_MAP].size = HV_HYP_PAGE_SIZE; pdata->info.mem[INT_PAGE_MAP].memtype = UIO_MEM_LOGICAL; pdata->info.mem[MON_PAGE_MAP].name = "monitor_page"; pdata->info.mem[MON_PAGE_MAP].addr = (uintptr_t)vmbus_connection.monitor_pages[1]; - pdata->info.mem[MON_PAGE_MAP].size = PAGE_SIZE; + pdata->info.mem[MON_PAGE_MAP].size = HV_HYP_PAGE_SIZE; pdata->info.mem[MON_PAGE_MAP].memtype = UIO_MEM_LOGICAL; if (channel->device_id == HV_NIC) { -- 2.51.0 From 0315fef2aff9f251ddef8a4b53db9187429c3553 Mon Sep 17 00:00:00 2001 From: Long Li Date: Mon, 5 May 2025 17:56:35 -0700 Subject: [PATCH 12/16] uio_hv_generic: Align ring size to system page Following the ring header, the ring data should align to system page boundary. Adjust the size if necessary. Cc: stable@vger.kernel.org Fixes: 95096f2fbd10 ("uio-hv-generic: new userspace i/o driver for VMBus") Signed-off-by: Long Li Reviewed-by: Michael Kelley Link: https://lore.kernel.org/r/1746492997-4599-4-git-send-email-longli@linuxonhyperv.com Signed-off-by: Wei Liu Message-ID: <1746492997-4599-4-git-send-email-longli@linuxonhyperv.com> --- drivers/uio/uio_hv_generic.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/uio/uio_hv_generic.c b/drivers/uio/uio_hv_generic.c index cc3a350dbbd5..aac67a4413ce 100644 --- a/drivers/uio/uio_hv_generic.c +++ b/drivers/uio/uio_hv_generic.c @@ -243,6 +243,9 @@ hv_uio_probe(struct hv_device *dev, if (!ring_size) ring_size = SZ_2M; + /* Adjust ring size if necessary to have it page aligned */ + ring_size = VMBUS_RING_SIZE(ring_size); + pdata = devm_kzalloc(&dev->device, sizeof(*pdata), GFP_KERNEL); if (!pdata) return -ENOMEM; -- 2.51.0 From a60822bc11b12e6c43cf25a610174ac44f744b27 Mon Sep 17 00:00:00 2001 From: Long Li Date: Mon, 5 May 2025 17:56:36 -0700 Subject: [PATCH 13/16] Drivers: hv: Use kzalloc for panic page allocation To prepare for removal of hv_alloc_* and hv_free* functions, use kzalloc/kfree directly for panic reporting page. Signed-off-by: Long Li Reviewed-by: Michael Kelley Link: https://lore.kernel.org/r/1746492997-4599-5-git-send-email-longli@linuxonhyperv.com Signed-off-by: Wei Liu Message-ID: <1746492997-4599-5-git-send-email-longli@linuxonhyperv.com> --- drivers/hv/hv_common.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c index f8e4330bd4c7..95c2497ffcff 100644 --- a/drivers/hv/hv_common.c +++ b/drivers/hv/hv_common.c @@ -272,7 +272,7 @@ static void hv_kmsg_dump_unregister(void) atomic_notifier_chain_unregister(&panic_notifier_list, &hyperv_panic_report_block); - hv_free_hyperv_page(hv_panic_page); + kfree(hv_panic_page); hv_panic_page = NULL; } @@ -280,7 +280,7 @@ static void hv_kmsg_dump_register(void) { int ret; - hv_panic_page = hv_alloc_hyperv_zeroed_page(); + hv_panic_page = kzalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL); if (!hv_panic_page) { pr_err("Hyper-V: panic message page memory allocation failed\n"); return; @@ -289,7 +289,7 @@ static void hv_kmsg_dump_register(void) ret = kmsg_dump_register(&hv_kmsg_dumper); if (ret) { pr_err("Hyper-V: kmsg dump register error 0x%x\n", ret); - hv_free_hyperv_page(hv_panic_page); + kfree(hv_panic_page); hv_panic_page = NULL; } } -- 2.51.0 From cd1769e1fef9ab8fbdfafb67b2c327418867afb6 Mon Sep 17 00:00:00 2001 From: Long Li Date: Mon, 5 May 2025 17:56:37 -0700 Subject: [PATCH 14/16] Drivers: hv: Remove hv_alloc/free_* helpers There are no users for those functions, remove them. Signed-off-by: Long Li Reviewed-by: Michael Kelley Link: https://lore.kernel.org/r/1746492997-4599-6-git-send-email-longli@linuxonhyperv.com Signed-off-by: Wei Liu Message-ID: <1746492997-4599-6-git-send-email-longli@linuxonhyperv.com> --- drivers/hv/hv_common.c | 39 ---------------------------------- include/asm-generic/mshyperv.h | 4 ---- 2 files changed, 43 deletions(-) diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c index 95c2497ffcff..49898d10faff 100644 --- a/drivers/hv/hv_common.c +++ b/drivers/hv/hv_common.c @@ -105,45 +105,6 @@ void __init hv_common_free(void) hv_synic_eventring_tail = NULL; } -/* - * Functions for allocating and freeing memory with size and - * alignment HV_HYP_PAGE_SIZE. These functions are needed because - * the guest page size may not be the same as the Hyper-V page - * size. We depend upon kmalloc() aligning power-of-two size - * allocations to the allocation size boundary, so that the - * allocated memory appears to Hyper-V as a page of the size - * it expects. - */ - -void *hv_alloc_hyperv_page(void) -{ - BUILD_BUG_ON(PAGE_SIZE < HV_HYP_PAGE_SIZE); - - if (PAGE_SIZE == HV_HYP_PAGE_SIZE) - return (void *)__get_free_page(GFP_KERNEL); - else - return kmalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL); -} -EXPORT_SYMBOL_GPL(hv_alloc_hyperv_page); - -void *hv_alloc_hyperv_zeroed_page(void) -{ - if (PAGE_SIZE == HV_HYP_PAGE_SIZE) - return (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO); - else - return kzalloc(HV_HYP_PAGE_SIZE, GFP_KERNEL); -} -EXPORT_SYMBOL_GPL(hv_alloc_hyperv_zeroed_page); - -void hv_free_hyperv_page(void *addr) -{ - if (PAGE_SIZE == HV_HYP_PAGE_SIZE) - free_page((unsigned long)addr); - else - kfree(addr); -} -EXPORT_SYMBOL_GPL(hv_free_hyperv_page); - static void *hv_panic_page; /* diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h index 6c51a25ed7b5..a729b77983fa 100644 --- a/include/asm-generic/mshyperv.h +++ b/include/asm-generic/mshyperv.h @@ -236,10 +236,6 @@ int hv_common_cpu_init(unsigned int cpu); int hv_common_cpu_die(unsigned int cpu); void hv_identify_partition_type(void); -void *hv_alloc_hyperv_page(void); -void *hv_alloc_hyperv_zeroed_page(void); -void hv_free_hyperv_page(void *addr); - /** * hv_cpu_number_to_vp_number() - Map CPU to VP. * @cpu_number: CPU number in Linux terms -- 2.51.0 From dd1af0c4c56d7c60eaf7f30f9d816ed1befbd7d7 Mon Sep 17 00:00:00 2001 From: Michael Kelley Date: Tue, 13 May 2025 21:44:40 -0700 Subject: [PATCH 15/16] PCI: hv: Remove unnecessary flex array in struct pci_packet struct pci_packet contains a "message" field that is a flex array of struct pci_message. struct pci_packet is usually followed by a second struct in a containing struct that is defined locally in individual functions in pci-hyperv.c. As such, the compiler flag -Wflex-array-member-not-at-end (introduced in gcc-14) generates multiple warnings such as: drivers/pci/controller/pci-hyperv.c:3809:35: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] The Linux kernel intends to introduce this compiler flag in standard builds, so the current code is problematic in generating these warnings. The "message" field is used only to locate the start of the second struct, and not as an array. Because the second struct can be addressed directly, the "message" field is not really necessary. Rather than try to fix its usage to meet the requirements of -Wflex-array-member-not-at-end, just eliminate the field and either directly reference the second struct, or use "pkt + 1" when "pkt" is dynamically allocated. Reported-by: Gustavo A. R. Silva Signed-off-by: Michael Kelley Link: https://lore.kernel.org/r/20250514044440.48924-1-mhklinux@outlook.com Signed-off-by: Wei Liu Message-ID: <20250514044440.48924-1-mhklinux@outlook.com> --- drivers/pci/controller/pci-hyperv.c | 29 ++++++++++++++--------------- 1 file changed, 14 insertions(+), 15 deletions(-) diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index 682f1ca2d684..b4f29ee75848 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -310,8 +310,6 @@ struct pci_packet { void (*completion_func)(void *context, struct pci_response *resp, int resp_packet_size); void *compl_ctxt; - - struct pci_message message[]; }; /* @@ -1496,7 +1494,7 @@ static int hv_read_config_block(struct pci_dev *pdev, void *buf, memset(&pkt, 0, sizeof(pkt)); pkt.pkt.completion_func = hv_pci_read_config_compl; pkt.pkt.compl_ctxt = &comp_pkt; - read_blk = (struct pci_read_block *)&pkt.pkt.message; + read_blk = (struct pci_read_block *)pkt.buf; read_blk->message_type.type = PCI_READ_BLOCK; read_blk->wslot.slot = devfn_to_wslot(pdev->devfn); read_blk->block_id = block_id; @@ -1576,7 +1574,7 @@ static int hv_write_config_block(struct pci_dev *pdev, void *buf, memset(&pkt, 0, sizeof(pkt)); pkt.pkt.completion_func = hv_pci_write_config_compl; pkt.pkt.compl_ctxt = &comp_pkt; - write_blk = (struct pci_write_block *)&pkt.pkt.message; + write_blk = (struct pci_write_block *)pkt.buf; write_blk->message_type.type = PCI_WRITE_BLOCK; write_blk->wslot.slot = devfn_to_wslot(pdev->devfn); write_blk->block_id = block_id; @@ -1657,7 +1655,7 @@ static void hv_int_desc_free(struct hv_pci_dev *hpdev, return; } memset(&ctxt, 0, sizeof(ctxt)); - int_pkt = (struct pci_delete_interrupt *)&ctxt.pkt.message; + int_pkt = (struct pci_delete_interrupt *)ctxt.buffer; int_pkt->message_type.type = PCI_DELETE_INTERRUPT_MESSAGE; int_pkt->wslot.slot = hpdev->desc.win_slot.slot; @@ -2540,7 +2538,7 @@ static struct hv_pci_dev *new_pcichild_device(struct hv_pcibus_device *hbus, comp_pkt.hpdev = hpdev; pkt.init_packet.compl_ctxt = &comp_pkt; pkt.init_packet.completion_func = q_resource_requirements; - res_req = (struct pci_child_message *)&pkt.init_packet.message; + res_req = (struct pci_child_message *)pkt.buffer; res_req->message_type.type = PCI_QUERY_RESOURCE_REQUIREMENTS; res_req->wslot.slot = desc->win_slot.slot; @@ -2918,7 +2916,7 @@ static void hv_eject_device_work(struct work_struct *work) pci_destroy_slot(hpdev->pci_slot); memset(&ctxt, 0, sizeof(ctxt)); - ejct_pkt = (struct pci_eject_response *)&ctxt.pkt.message; + ejct_pkt = (struct pci_eject_response *)ctxt.buffer; ejct_pkt->message_type.type = PCI_EJECTION_COMPLETE; ejct_pkt->wslot.slot = hpdev->desc.win_slot.slot; vmbus_sendpacket(hbus->hdev->channel, ejct_pkt, @@ -3176,7 +3174,7 @@ static int hv_pci_protocol_negotiation(struct hv_device *hdev, init_completion(&comp_pkt.host_event); pkt->completion_func = hv_pci_generic_compl; pkt->compl_ctxt = &comp_pkt; - version_req = (struct pci_version_request *)&pkt->message; + version_req = (struct pci_version_request *)(pkt + 1); version_req->message_type.type = PCI_QUERY_PROTOCOL_VERSION; for (i = 0; i < num_version; i++) { @@ -3398,7 +3396,7 @@ enter_d0_retry: init_completion(&comp_pkt.host_event); pkt->completion_func = hv_pci_generic_compl; pkt->compl_ctxt = &comp_pkt; - d0_entry = (struct pci_bus_d0_entry *)&pkt->message; + d0_entry = (struct pci_bus_d0_entry *)(pkt + 1); d0_entry->message_type.type = PCI_BUS_D0ENTRY; d0_entry->mmio_base = hbus->mem_config->start; @@ -3556,20 +3554,20 @@ static int hv_send_resources_allocated(struct hv_device *hdev) if (hbus->protocol_version < PCI_PROTOCOL_VERSION_1_2) { res_assigned = - (struct pci_resources_assigned *)&pkt->message; + (struct pci_resources_assigned *)(pkt + 1); res_assigned->message_type.type = PCI_RESOURCES_ASSIGNED; res_assigned->wslot.slot = hpdev->desc.win_slot.slot; } else { res_assigned2 = - (struct pci_resources_assigned2 *)&pkt->message; + (struct pci_resources_assigned2 *)(pkt + 1); res_assigned2->message_type.type = PCI_RESOURCES_ASSIGNED2; res_assigned2->wslot.slot = hpdev->desc.win_slot.slot; } put_pcichild(hpdev); - ret = vmbus_sendpacket(hdev->channel, &pkt->message, + ret = vmbus_sendpacket(hdev->channel, pkt + 1, size_res, (unsigned long)pkt, VM_PKT_DATA_INBAND, VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED); @@ -3867,6 +3865,7 @@ static int hv_pci_bus_exit(struct hv_device *hdev, bool keep_devs) struct pci_packet teardown_packet; u8 buffer[sizeof(struct pci_message)]; } pkt; + struct pci_message *msg; struct hv_pci_compl comp_pkt; struct hv_pci_dev *hpdev, *tmp; unsigned long flags; @@ -3912,10 +3911,10 @@ static int hv_pci_bus_exit(struct hv_device *hdev, bool keep_devs) init_completion(&comp_pkt.host_event); pkt.teardown_packet.completion_func = hv_pci_generic_compl; pkt.teardown_packet.compl_ctxt = &comp_pkt; - pkt.teardown_packet.message[0].type = PCI_BUS_D0EXIT; + msg = (struct pci_message *)pkt.buffer; + msg->type = PCI_BUS_D0EXIT; - ret = vmbus_sendpacket_getid(chan, &pkt.teardown_packet.message, - sizeof(struct pci_message), + ret = vmbus_sendpacket_getid(chan, msg, sizeof(*msg), (unsigned long)&pkt.teardown_packet, &trans_id, VM_PKT_DATA_INBAND, VMBUS_DATA_PACKET_FLAG_COMPLETION_REQUESTED); -- 2.51.0 From f77276d1c9597ade06f3facbceb51bd1abf1bd5c Mon Sep 17 00:00:00 2001 From: Michael Kelley Date: Mon, 19 May 2025 21:44:35 -0700 Subject: [PATCH 16/16] Documentation: hyperv: Update VMBus doc with new features and info Starting in the 6.15 kernel, VMBus interrupts are automatically assigned away from a CPU that is being taken offline. Add documentation describing this case. Also add details of Hyper-V behavior when the primary channel of a VMBus device is closed as the result of unbinding the device's driver. This behavior has not changed, but it was not previously documented. Signed-off-by: Michael Kelley Link: https://lore.kernel.org/r/20250520044435.7734-1-mhklinux@outlook.com Signed-off-by: Wei Liu Message-ID: <20250520044435.7734-1-mhklinux@outlook.com> --- Documentation/virt/hyperv/vmbus.rst | 28 ++++++++++++++++++++++++---- 1 file changed, 24 insertions(+), 4 deletions(-) diff --git a/Documentation/virt/hyperv/vmbus.rst b/Documentation/virt/hyperv/vmbus.rst index 1dcef6a7fda3..654bb4849972 100644 --- a/Documentation/virt/hyperv/vmbus.rst +++ b/Documentation/virt/hyperv/vmbus.rst @@ -250,10 +250,18 @@ interrupts are not Linux IRQs, there are no entries in /proc/interrupts or /proc/irq corresponding to individual VMBus channel interrupts. An online CPU in a Linux guest may not be taken offline if it has -VMBus channel interrupts assigned to it. Any such channel -interrupts must first be manually reassigned to another CPU as -described above. When no channel interrupts are assigned to the -CPU, it can be taken offline. +VMBus channel interrupts assigned to it. Starting in kernel v6.15, +any such interrupts are automatically reassigned to some other CPU +at the time of offlining. The "other" CPU is chosen by the +implementation and is not load balanced or otherwise intelligently +determined. If the CPU is onlined again, channel interrupts previously +assigned to it are not moved back. As a result, after multiple CPUs +have been offlined, and perhaps onlined again, the interrupt-to-CPU +mapping may be scrambled and non-optimal. In such a case, optimal +assignments must be re-established manually. For kernels v6.14 and +earlier, any conflicting channel interrupts must first be manually +reassigned to another CPU as described above. Then when no channel +interrupts are assigned to the CPU, it can be taken offline. The VMBus channel interrupt handling code is designed to work correctly even if an interrupt is received on a CPU other than the @@ -324,3 +332,15 @@ rescinded, neither Hyper-V nor Linux retains any state about its previous existence. Such a device might be re-added later, in which case it is treated as an entirely new device. See vmbus_onoffer_rescind(). + +For some devices, such as the KVP device, Hyper-V automatically +sends a rescind message when the primary channel is closed, +likely as a result of unbinding the device from its driver. +The rescind causes Linux to remove the device. But then Hyper-V +immediately reoffers the device to the guest, causing a new +instance of the device to be created in Linux. For other +devices, such as the synthetic SCSI and NIC devices, closing the +primary channel does *not* result in Hyper-V sending a rescind +message. The device continues to exist in Linux on the VMBus, +but with no driver bound to it. The same driver or a new driver +can subsequently be bound to the existing instance of the device. -- 2.51.0