From: Henry Willard Date: Thu, 14 Mar 2019 18:01:01 +0000 (-0700) Subject: x86/apic: Make arch_setup_hwirq NUMA node aware X-Git-Tag: v4.1.12-124.31.3~201 X-Git-Url: https://www.infradead.org/git/?a=commitdiff_plain;h=ccce8667f4e98dce515b2468886a54c34fbeff3d;p=users%2Fjedix%2Flinux-maple.git x86/apic: Make arch_setup_hwirq NUMA node aware This is a second attempt to fix bug 29292411. The first attempt ran afoul of special behavior in cluster_vector_allocation_domain() explained below. This was discovered by QA immediately before QU7 was to be release, so the original was reverted. In a xen VM with vNUMA enabled, irq affinity for a device on node 1 may become stuck on CPU 0. /proc/irq/nnn/smp_affinity_list may show affinity for all the CPUs on node 1, but this is wrong. All interrupts are on the first CPU of node 0 which is usually CPU 0. The problem is caused when __assign_irq_vector() is called by arch_setup_hwirq() with a mask of all online CPUs, and then called later with a mask including only the node 1 CPUs. The first call assigns affinity to CPU 0, and the second tries to move affinity to the first online node 1 CPU. In the reported case this is always CPU 2. For some reason, the CPU 0 affinity is never cleaned up, and all interrupts remain with CPU 0. Since an incomplete move appears to be in progress, all attempts to reassign affinity for the irq fail. Because of a quirk in how affinity is displayed in /proc/irq/nnn/smp_affinity_list, changes may appear to work temporarily. It was not reproducible on baremetal on the machine I had available for testing, but it is possible that it was observed on other machines. It does not appear in UEK5. The APIC and IRQ code is very different in UEK5, and the code changed here doesn't exist in UEK5. Upstream has completely abandoned the UEK4 approach to IRQ management. It is unknown whether KVM guests might see the same problem with UEK4. Making arch_setup_hwirq() NUMA sensitive eliminates the problem by using the correct cpumask for the node for the initial assignment. The second assignment becomes a noop. After initialization is complete, affinity can be moved to any CPU on any node and back without a problem. However, cluster_vector_allocation_domain() contains a hack designed to reduce vector pressure in cluster x2apic. Specifically, it relies on the address of the cpumask passed to it to determine if this allocation is for the default device bringup case or explicit migration. If the address of the cpumask does not match what is returned by apic->target_cpus(), it assumes it is the explicit migration case and goes into cluster mode which uses up vectors on multiple CPUs. Since the original patch modifies arch_setup_hwirq() to pass a cpumask with only local CPUs in it, cluster_vector_allocation_domain() allocates for the entire cluster rather than a single CPU. This can cause vector allocation failures when there are a very large number of devices such as can be the case when there are a large number of VFs (see bug 29534769). Orabug: 29534769 Signed-off-by: Henry Willard Reviewed-by: Shan Hai Signed-off-by: Brian Maly --- diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 4902161b69e3..f784427675c4 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -509,7 +509,17 @@ int arch_setup_hwirq(unsigned int irq, int node) return -ENOMEM; raw_spin_lock_irqsave(&vector_lock, flags); - ret = __assign_irq_vector(irq, cfg, apic->target_cpus()); + if (node != NUMA_NO_NODE && !x2apic_enabled()) { + const struct cpumask *node_mask = cpumask_of_node(node); + struct cpumask apic_mask; + + cpumask_copy(&apic_mask, apic->target_cpus()); + if (cpumask_intersects(&apic_mask, node_mask)) + cpumask_and(&apic_mask, &apic_mask, node_mask); + ret = __assign_irq_vector(irq, cfg, &apic_mask); + } else { + ret = __assign_irq_vector(irq, cfg, apic->target_cpus()); + } raw_spin_unlock_irqrestore(&vector_lock, flags); if (!ret)