This is a second attempt to fix bug
29292411. The first attempt ran
afoul of special behavior in cluster_vector_allocation_domain() explained
below. This was discovered by QA immediately before QU7 was to be release,
so the original was reverted.
In a xen VM with vNUMA enabled, irq affinity for a device on node 1
may become stuck on CPU 0. /proc/irq/nnn/smp_affinity_list may show
affinity for all the CPUs on node 1, but this is wrong. All interrupts
are on the first CPU of node 0 which is usually CPU 0.
The problem is caused when __assign_irq_vector() is called by
arch_setup_hwirq() with a mask of all online CPUs, and then called later
with a mask including only the node 1 CPUs. The first call assigns affinity
to CPU 0, and the second tries to move affinity to the first online node 1
CPU. In the reported case this is always CPU 2. For some reason, the
CPU 0 affinity is never cleaned up, and all interrupts remain with CPU 0.
Since an incomplete move appears to be in progress, all attempts to
reassign affinity for the irq fail. Because of a quirk in how affinity is
displayed in /proc/irq/nnn/smp_affinity_list, changes may appear to work
temporarily.
It was not reproducible on baremetal on the machine I had available for
testing, but it is possible that it was observed on other machines. It
does not appear in UEK5. The APIC and IRQ code is very different in UEK5,
and the code changed here doesn't exist in UEK5. Upstream has completely
abandoned the UEK4 approach to IRQ management. It is unknown whether KVM
guests might see the same problem with UEK4.
Making arch_setup_hwirq() NUMA sensitive eliminates the problem by
using the correct cpumask for the node for the initial assignment. The
second assignment becomes a noop. After initialization is complete,
affinity can be moved to any CPU on any node and back without a problem.
However, cluster_vector_allocation_domain() contains a hack designed to
reduce vector pressure in cluster x2apic. Specifically, it relies on
the address of the cpumask passed to it to determine if this allocation
is for the default device bringup case or explicit migration. If the
address of the cpumask does not match what is returned by
apic->target_cpus(), it assumes it is the explicit migration case and goes
into cluster mode which uses up vectors on multiple CPUs. Since the
original patch modifies arch_setup_hwirq() to pass a cpumask with only
local CPUs in it, cluster_vector_allocation_domain() allocates for the
entire cluster rather than a single CPU. This can cause vector allocation
failures when there are a very large number of devices such as can be the
case when there are a large number of VFs (see bug
29534769).
Orabug:
29534769
Signed-off-by: Henry Willard <henry.willard@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>