Michael Ellerman [Wed, 21 Apr 2021 12:54:01 +0000 (22:54 +1000)]
powerpc/fadump: Fix sparse warnings
Sparse says:
arch/powerpc/kernel/fadump.c:48:16: warning: symbol 'fadump_kobj' was not declared. Should it be static?
arch/powerpc/kernel/fadump.c:55:27: warning: symbol 'crash_mrange_info' was not declared. Should it be static?
arch/powerpc/kernel/fadump.c:61:27: warning: symbol 'reserved_mrange_info' was not declared. Should it be static?
arch/powerpc/kernel/fadump.c:83:12: warning: symbol 'fadump_cma_init' was not declared. Should it be static?
And indeed none of them are used outside this file, they can all be made
static. Also fadump_kobj needs to be moved inside the ifdef where it's
used.
When probe_kernel_read_inst() was created, there was no good place to
put it, so a file called lib/inst.c was dedicated for it.
Since then, probe_kernel_read_inst() has been renamed
copy_inst_from_kernel_nofault(). And mm/maccess.h didn't exist at that
time. Today, mm/maccess.h is related to copy_from_kernel_nofault().
Move copy_inst_from_kernel_nofault() into mm/maccess.c
powerpc: Make probe_kernel_read_inst() common to PPC32 and PPC64
We have two independant versions of probe_kernel_read_inst(), one for
PPC32 and one for PPC64.
The PPC32 is identical to the first part of the PPC64 version.
The remaining part of PPC64 version is not relevant for PPC32, but
not contradictory, so we can easily have a common function with
the PPC64 part opted out via a IS_ENABLED(CONFIG_PPC64).
The only need is to add a version of ppc_inst_prefix() for PPC32.
Its name comes from former probe_user_read() function.
That function is now called copy_from_user_nofault().
probe_user_read_inst() uses copy_from_user_nofault() to read only
a few bytes. It is suboptimal.
It does the same as get_user_inst() but in addition disables
page faults.
But on the other hand, it is not used for the time being. So remove it
for now. If one day it is really needed, we can give it a new name
more in line with today's naming, and implement it using get_user_inst()
powerpc/ebpf32: Use standard function call for functions within 32M distance
If the target of a function call is within 32 Mbytes distance, use a
standard function call with 'bl' instead of the 'lis/ori/mtlr/blrl'
sequence.
In the first pass, no memory has been allocated yet and the code
position is not known yet (image pointer is NULL). This pass is there
to calculate the amount of memory to allocate for the EBPF code, so
assume the 4 instructions sequence is required, so that enough memory
is allocated.
Christophe Leroy [Fri, 22 Jan 2021 07:15:03 +0000 (07:15 +0000)]
powerpc/32: Use r2 in wrtspr() instead of r0
wrtspr() is a function to write an arbitrary value in a special
register. It is used on 8xx to write to SPRN_NRI, SPRN_EID and
SPRN_EIE. Writing any value to one of those will play with MSR EE
and MSR RI regardless of that value.
r0 is used many places in the generated code and using r0 for
that creates an unnecessary dependency of this instruction with
preceding ones using r0 in a few places in vmlinux.
r2 is most likely the most stable register as it contains the
pointer to 'current'.
Using r2 instead of r0 avoids that unnecessary dependency.
powerpc/mce: save ignore_event flag unconditionally for UE
When we hit an UE while using machine check safe copy routines,
ignore_event flag is set and the event is ignored by mce handler,
And the flag is also saved for defered handling and printing of
mce event information, But as of now saving of this flag is done
on checking if the effective address is provided or physical address
is calculated, which is not right.
Save ignore_event flag regardless of whether the effective address is
provided or physical address is calculated.
Without this change following log is seen, when the event is to be
ignored.
For that, create a 32 bits version of patch_imm64_load_insns()
and create a patch_imm_load_insns() which calls
patch_imm32_load_insns() on PPC32 and patch_imm64_load_insns()
on PPC64.
Adapt optprobes_head.S for PPC32. Use PPC_LL/PPC_STL macros instead
of raw ld/std, opt out things linked to paca and use stmw/lmw to
save/restore registers.
Leonardo Bras [Tue, 20 Apr 2021 04:54:04 +0000 (01:54 -0300)]
powerpc/pseries/iommu: Fix window size for direct mapping with pmem
As of today, if the DDW is big enough to fit (1 << MAX_PHYSMEM_BITS)
it's possible to use direct DMA mapping even with pmem region.
But, if that happens, the window size (len) is set to (MAX_PHYSMEM_BITS
- page_shift) instead of MAX_PHYSMEM_BITS, causing a pagesize times
smaller DDW to be created, being insufficient for correct usage.
Fix this so the correct window size is used in this case.
Fixes: bf6e2d562bbc4 ("powerpc/dma: Fallback to dma_ops when persistent memory present") Signed-off-by: Leonardo Bras <leobras.c@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210420045404.438735-1-leobras.c@gmail.com
Michael Ellerman [Mon, 19 Apr 2021 12:01:39 +0000 (22:01 +1000)]
powerpc/kvm: Fix PR KVM with KUAP/MEM_KEYS enabled
The changes to add KUAP support with the hash MMU broke booting of KVM
PR guests. The symptom is no visible progress of the guest, or possibly
just "SLOF" being printed to the qemu console.
Host code is still executing, but breaking into xmon might show a stack
trace such as:
Bisect points to commit b2ff33a10c8b ("powerpc/book3s64/hash/kuap:
Enable kuap on hash"), but that's just the commit that enabled KUAP with
hash and made the bug visible.
The root cause seems to be that KVM PR is creating kernel mappings that
don't use the correct key, since we switched to using key 3.
We have a helper for adding the right key value, however it's designed
to take a pteflags variable, which the KVM code doesn't have. But we can
make it work by passing 0 for the pteflags, and tell it explicitly that
it should use the kernel key.
Which happens because by the time we get to rtas_stop_self() we are
already offline. In addition the message can be spammy, and is not that
helpful for users, so remove it.
Michael Ellerman [Sun, 18 Apr 2021 13:16:41 +0000 (23:16 +1000)]
powerpc: Only define _TASK_CPU for 32-bit
We have some interesting code in our Makefile to define _TASK_CPU, based
on awk'ing the value out of asm-offsets.h. It exists to circumvent some
circular header dependencies that prevent us from referring to
task_struct in the relevant code. See the comment around _TASK_CPU in
smp.h for more detail.
Maybe one day we can come up with a better solution, but for now we can
at least limit that logic to 32-bit, because it's not needed for 64-bit.
powerpc/pseries: Add shutdown() to vio_driver and vio_bus
Currently, neither the vio_bus or vio_driver structures provide support
for a shutdown() routine.
Add support for shutdown() by allowing drivers to provide a
implementation via function pointer in their vio_driver struct and
provide a proper implementation in the driver template for the vio_bus
that calls a vio drivers shutdown() if defined.
In the case that no shutdown() is defined by a vio driver and a kexec is
in progress we implement a big hammer that calls remove() to ensure no
further DMA for the devices is possible.
Athira Rajeev [Mon, 22 Mar 2021 14:57:23 +0000 (10:57 -0400)]
powerpc/perf: Expose processor pipeline stage cycles using PERF_SAMPLE_WEIGHT_STRUCT
Performance Monitoring Unit (PMU) registers in powerpc provides
information on cycles elapsed between different stages in the
pipeline. This can be used for application tuning. On ISA v3.1
platform, this information is exposed by sampling registers.
Patch adds kernel support to capture two of the cycle counters
as part of perf sample using the sample type:
PERF_SAMPLE_WEIGHT_STRUCT.
The power PMU function 'get_mem_weight' currently uses 64 bit weight
field of perf_sample_data to capture memory latency. But following the
introduction of PERF_SAMPLE_WEIGHT_TYPE, weight field could contain
64-bit or 32-bit value depending on the architexture support for
PERF_SAMPLE_WEIGHT_STRUCT. Patches uses WEIGHT_STRUCT to expose the
pipeline stage cycles info. Hence update the ppmu functions to work for
64-bit and 32-bit weight values.
If the sample type is PERF_SAMPLE_WEIGHT, use the 64-bit weight field.
if the sample type is PERF_SAMPLE_WEIGHT_STRUCT, memory subsystem
latency is stored in the low 32bits of perf_sample_weight structure.
Also for CPU_FTR_ARCH_31, capture the two cycle counter information in
two 16 bit fields of perf_sample_weight structure.
powerpc/pseries: Set UNISOLATE on dlpar_cpu_remove() failure
The RTAS set-indicator call, when attempting to UNISOLATE a DRC that is
already UNISOLATED or CONFIGURED, returns RTAS_OK and does nothing else
for both QEMU and phyp. This gives us an opportunity to use this
behavior to signal the hypervisor layer when an error during device
removal happens, allowing it to do a proper error handling, while not
breaking QEMU/phyp implementations that don't have this support.
This patch introduces this idea by unisolating all CPU DRCs that failed
to be removed by dlpar_cpu_remove_by_index(), when handling the
PSERIES_HP_ELOG_ID_DRC_INDEX event. This is being done for this event
only because its the only CPU removal event QEMU uses, and there's no
need at this moment to add this mechanism for phyp only code.
Michael Ellerman [Mon, 19 Apr 2021 12:24:32 +0000 (22:24 +1000)]
powerpc/fadump: Fix compile error since trap type change
sfr reports that the allyesconfig build fails with:
arch/powerpc/kernel/fadump.c: In function 'crash_fadump':
arch/powerpc/kernel/fadump.c:731:28: error: 'INTERRUPT_SYSTEM_RESET' undeclared
731 | if (TRAP(&(fdh->regs)) == INTERRUPT_SYSTEM_RESET) {
Add an include of interrupt.h to fix it.
Fixes: 7153d4bf0b37 ("powerpc/traps: Enhance readability for trap types") Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
[mpe: Reformat change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210419191425.281dc58a@canb.auug.org.au
Nicholas Piggin [Fri, 2 Apr 2021 02:41:24 +0000 (12:41 +1000)]
powerpc/powernv: Enable HAIL (HV AIL) for ISA v3.1 processors
Starting with ISA v3.1, LPCR[AIL] no longer controls the interrupt
mode for HV=1 interrupts. Instead, a new LPCR[HAIL] bit is defined
which behaves like AIL=3 for HV interrupts when set.
Set HAIL on bare metal to give us mmu-on interrupts and improve
performance.
This also fixes an scv bug: we don't implement scv real mode (AIL=0)
vectors because they are at an inconvenient location, so we just
disable scv support when AIL can not be set. However powernv assumes
that LPCR[AIL] will enable AIL mode so it enables scv support despite
HV interrupts being AIL=0, which causes scv interrupts to go off into
the weeds.
Fixes: 7fa95f9adaee ("powerpc/64s: system call support for scv/rfscv instructions") Cc: stable@vger.kernel.org # v5.9+ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210402024124.545826-1-npiggin@gmail.com
Some of the per-CPU masks use cpu_cpu_mask as a filter to limit the search
for related CPUs. On a dlpar add of a CPU, update cpu_cpu_mask before
updating the per-CPU masks. This will ensure the cpu_cpu_mask is updated
correctly before its used in setting the masks. Setting the numa_node will
ensure that when cpu_cpu_mask() gets called, the correct node number is
used. This code movement helped fix the above call trace.
Xiongwei Song [Wed, 14 Apr 2021 11:00:33 +0000 (19:00 +0800)]
powerpc/traps: Enhance readability for trap types
Define macros to list ppc interrupt types in interttupt.h, replace the
reference of the trap hex values with these macros.
Referred the hex numbers in arch/powerpc/kernel/exceptions-64e.S,
arch/powerpc/kernel/exceptions-64s.S, arch/powerpc/kernel/head_*.S,
arch/powerpc/kernel/head_booke.h and arch/powerpc/include/asm/kvm_asm.h.
Tony Ambardar [Thu, 17 Sep 2020 13:54:37 +0000 (06:54 -0700)]
powerpc: fix EDEADLOCK redefinition error in uapi/asm/errno.h
A few archs like powerpc have different errno.h values for macros
EDEADLOCK and EDEADLK. In code including both libc and linux versions of
errno.h, this can result in multiple definitions of EDEADLOCK in the
include chain. Definitions to the same value (e.g. seen with mips) do
not raise warnings, but on powerpc there are redefinitions changing the
value, which raise warnings and errors (if using "-Werror").
Guard against these redefinitions to avoid build errors like the following,
first seen cross-compiling libbpf v5.8.9 for powerpc using GCC 8.4.0 with
musl 1.1.24:
In file included from ../../arch/powerpc/include/uapi/asm/errno.h:5,
from ../../include/linux/err.h:8,
from libbpf.c:29:
../../include/uapi/asm-generic/errno.h:40: error: "EDEADLOCK" redefined [-Werror]
#define EDEADLOCK EDEADLK
In file included from toolchain-powerpc_8540_gcc-8.4.0_musl/include/errno.h:10,
from libbpf.c:26:
toolchain-powerpc_8540_gcc-8.4.0_musl/include/bits/errno.h:58: note: this is the location of the previous definition
#define EDEADLOCK 58
On systems with large CPUs per node, even with the filtered matching of
related CPUs, there can be large number of calls to cpu_to_chip_id for
the same CPU. For example with 4096 vCPU, 1 node QEMU configuration,
with 4 threads per core, system could be see upto 1024 calls to
cpu_to_chip_id() for the same CPU. On a given system, cpu_to_chip_id()
for a given CPU would always return the same. Hence cache the result in
a lookup table for use in subsequent calls.
Since all CPUs sharing the same core will belong to the same chip, the
lookup_table has an entry for one CPU per core. chip_id_lookup_table is
not being freed and would be used on subsequent CPU online post CPU
offline.
Reported-by: Daniel Henrique Barboza <danielhb413@gmail.com> Suggested-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210415120934.232271-4-srikar@linux.vnet.ibm.com
Daniel reported that with Commit 4ca234a9cbd7 ("powerpc/smp: Stop
updating cpu_core_mask") QEMU was unable to set single NUMA node SMP
topologies such as:
-smp 8,maxcpus=8,cores=2,threads=2,sockets=2
i.e he expected 2 sockets in one NUMA node.
The above commit helped to reduce boot time on Large Systems for
example 4096 vCPU single socket QEMU instance. PAPR is silent on
having more than one socket within a NUMA node.
cpu_core_mask and cpu_cpu_mask for any CPU would be same unless the
number of sockets is different from the number of NUMA nodes.
One option is to reintroduce cpu_core_mask but use a slightly
different method to arrive at the cpu_core_mask. Previously each CPU's
chip-id would be compared with all other CPU's chip-id to verify if
both the CPUs were related at the chip level. Now if a CPU 'A' is
found related / (unrelated) to another CPU 'B', all the thread
siblings of 'A' and thread siblings of 'B' are automatically marked as
related / (unrelated).
Also if a platform doesn't support ibm,chip-id property, i.e its
cpu_to_chip_id returns -1, cpu_core_map holds a copy of
cpu_cpu_mask().
Fixes: 4ca234a9cbd7 ("powerpc/smp: Stop updating cpu_core_mask") Reported-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210415120934.232271-2-srikar@linux.vnet.ibm.com
powerpc/xive: Use the "ibm, chip-id" property only under PowerNV
The 'chip_id' field of the XIVE CPU structure is used to choose a
target for a source located on the same chip. For that, the XIVE
driver queries the chip identifier from the "ibm,chip-id" property
and compares it to a 'src_chip' field identifying the chip of a
source. This information is only available on the PowerNV platform,
'src_chip' being assigned to XIVE_INVALID_CHIP_ID under pSeries.
The "ibm,chip-id" property is also not available on all platforms. It
was first introduced on PowerNV and later, under QEMU for pSeries/KVM.
However, the property is not part of PAPR and does not exist under
pSeries/PowerVM.
Assign 'chip_id' to XIVE_INVALID_CHIP_ID by default and let the
PowerNV platform override the value with the "ibm,chip-id" property.
Tyrel Datwyler [Thu, 11 Feb 2021 18:24:35 +0000 (12:24 -0600)]
powerpc/pseries: extract host bridge from pci_bus prior to bus removal
The pci_bus->bridge reference may no longer be valid after
pci_bus_remove() resulting in passing a bad value to device_unregister()
for the associated bridge device.
Store the host_bridge reference in a separate variable prior to
pci_bus_remove().
Michael Ellerman [Fri, 16 Apr 2021 11:07:06 +0000 (21:07 +1000)]
powerpc/papr_scm: Fix build error due to wrong printf specifier
When I changed the rc variable to be long rather than int64_t I
neglected to update the printk(), leading to a build break:
arch/powerpc/platforms/pseries/papr_scm.c: In function 'papr_scm_pmem_flush':
arch/powerpc/platforms/pseries/papr_scm.c:144:26: warning: format
'%lld' expects argument of type 'long long int', but argument 3 has
type 'long int' [-Wformat=]
Christophe Leroy [Wed, 31 Mar 2021 16:48:47 +0000 (16:48 +0000)]
powerpc/vdso: Add support for time namespaces
This patch adds the necessary glue to provide time namespaces.
Things are mainly copied from ARM64.
__arch_get_timens_vdso_data() calculates timens vdso data position
based on the vdso data position, knowing it is the next page in vvar.
This avoids having to redo the mflr/bcl/mflr/mtlr dance to locate
the page relative to running code position.
Dmitry Safonov [Wed, 31 Mar 2021 16:48:46 +0000 (16:48 +0000)]
powerpc/vdso: Separate vvar vma from vdso
Since commit 511157ab641e ("powerpc/vdso: Move vdso datapage up front")
VVAR page is in front of the VDSO area. In result it breaks CRIU
(Checkpoint Restore In Userspace) [1], where CRIU expects that "[vdso]"
from /proc/../maps points at ELF/vdso image, rather than at VVAR data page.
Laurent made a patch to keep CRIU working (by reading aux vector).
But I think it still makes sence to separate two mappings into different
VMAs. It will also make ppc64 less "special" for userspace and as
a side-bonus will make VVAR page un-writable by debugger (which previously
would COW page and can be unexpected).
I opportunistically Cc stable on it: I understand that usually such
stuff isn't a stable material, but that will allow us in CRIU have
one workaround less that is needed just for one release (v5.11) on
one platform (ppc64), which we otherwise have to maintain.
I wouldn't go as far as to say that the commit 511157ab641e is ABI
regression as no other userspace got broken, but I'd really appreciate
if it gets backported to v5.11 after v5.12 is released, so as not
to complicate already non-simple CRIU-vdso code. Thanks!
Christophe Leroy [Wed, 31 Mar 2021 16:48:45 +0000 (16:48 +0000)]
lib/vdso: Add vdso_data pointer as input to __arch_get_timens_vdso_data()
For the same reason as commit e876f0b69dc9 ("lib/vdso: Allow
architectures to provide the vdso data pointer"), powerpc wants to
avoid calculation of relative position to code.
As the timens_vdso_data is next page to vdso_data, provide
vdso_data pointer to __arch_get_timens_vdso_data() in order
to ease the calculation on powerpc in following patches.
Christophe Leroy [Wed, 31 Mar 2021 16:48:44 +0000 (16:48 +0000)]
lib/vdso: Mark do_hres_timens() and do_coarse_timens() __always_inline()
In the same spirit as commit c966533f8c6c ("lib/vdso: Mark do_hres()
and do_coarse() as __always_inline"), mark do_hres_timens() and
do_coarse_timens() __always_inline.
The measurement below in on a non timens process, ie on the fastest path.
Nicholas Piggin [Tue, 16 Mar 2021 10:42:05 +0000 (20:42 +1000)]
powerpc: move norestart trap flag to bit 0
Compact the trap flags down to use the low 4 bits of regs.trap.
A few 64e interrupt trap numbers set bit 4. Although they tended to be
trivial so it wasn't a real problem[1], it is not the right thing to do,
and confusing.
[*] E.g., 0x310 hypercall goes to unknown_exception, which prints
regs->trap directly so 0x310 will appear fine, and only the syscall
interrupt will test norestart, so it won't be confused by 0x310.
Nicholas Piggin [Tue, 16 Mar 2021 10:42:03 +0000 (20:42 +1000)]
powerpc: clean up do_page_fault
search_exception_tables + __bad_page_fault can be substituted with
bad_page_fault, do_page_fault no longer needs to return a value
to asm for any sub-architecture, and __bad_page_fault can be static.
Nicholas Piggin [Tue, 16 Mar 2021 10:42:01 +0000 (20:42 +1000)]
powerpc/64e/interrupt: Use new interrupt context tracking scheme
With the new interrupt exit code, context tracking can be managed
more precisely, so remove the last of the 64e workarounds and switch
to the new context tracking code already used by 64s.
Nicholas Piggin [Tue, 16 Mar 2021 10:41:55 +0000 (20:41 +1000)]
powerpc/syscall: switch user_exit_irqoff and trace_hardirqs_off order
user_exit_irqoff() -> __context_tracking_exit -> vtime_user_exit
warns in __seqprop_assert due to lockdep thinking preemption is enabled
because trace_hardirqs_off() has not yet been called.
Switch the order of these two calls, which matches their ordering in
interrupt_enter_prepare.
powerpc/perf: Infrastructure to support checking of attr.config*
Introduce code to support the checking of attr.config* for
values which are reserved for a given platform.
Performance Monitoring Unit (PMU) configuration registers
have fields that are reserved and some specific values for
bit fields are reserved. For ex., MMCRA[61:62] is
Random Sampling Mode (SM) and value of 0b11 for this field
is reserved.
Writing non-zero or invalid values in these fields will
have unknown behaviours.
Patch adds a generic call-back function "check_attr_config"
in "struct power_pmu", to be called in event_init to
check for attr.config* values for a given platform.
powerpc/mem: flush_dcache_icache_phys() is for HIGHMEM pages only
__flush_dcache_icache() is usable for non HIGHMEM pages on
every platform.
It is only for HIGHMEM pages that BOOKE needs kmap() and
BOOK3S needs flush_dcache_icache_phys().
So make flush_dcache_icache_phys() dependent on CONFIG_HIGHMEM and
call it only when it is a HIGHMEM page.
We could make flush_dcache_icache_phys() available at all time,
but as it is declared NOKPROBE_SYMBOL(), GCC doesn't optimise
it out when it is not used.
So define a stub for !CONFIG_HIGHMEM in order to remove the #ifdef in
flush_dcache_icache_page() and use IS_ENABLED() instead.
flush_dcache_icache_hugepage() is a static function, with
only one caller. That caller calls it when PageCompound() is true,
so bugging on !PageCompound() is useless if we can trust the
compiler a little. Remove the BUG_ON(!PageCompound()).
The number of elements of a page won't change over time, but
GCC doesn't know about it, so it gets the value at every iteration.
To avoid that, call compound_nr() outside the loop and save it in
a local variable.
Whether the page is a HIGHMEM page or not doesn't change over time.
But GCC doesn't know it so it does the test on every iteration.
Do the test outside the loop.
When the page is not a HIGHMEM page, page_address() will fallback on
lowmem_page_address(), so call lowmem_page_address() directly and
don't suffer the call to page_address() on every iteration.
powerpc/mem: Call flush_coherent_icache() at higher level
flush_coherent_icache() doesn't need the address anymore,
so it can be called immediately when entering the public
functions and doesn't need to be disseminated among
lower level functions.
And use page_to_phys() instead of open coding the calculation
of phys address to call flush_dcache_icache_phys().
Bixuan Cui [Fri, 9 Apr 2021 09:01:24 +0000 (17:01 +0800)]
powerpc/perf/hv-24x7: Make some symbols static
The sparse tool complains as follows:
arch/powerpc/perf/hv-24x7.c:229:1: warning:
symbol '__pcpu_scope_hv_24x7_txn_flags' was not declared. Should it be static?
arch/powerpc/perf/hv-24x7.c:230:1: warning:
symbol '__pcpu_scope_hv_24x7_txn_err' was not declared. Should it be static?
arch/powerpc/perf/hv-24x7.c:236:1: warning:
symbol '__pcpu_scope_hv_24x7_hw' was not declared. Should it be static?
arch/powerpc/perf/hv-24x7.c:244:1: warning:
symbol '__pcpu_scope_hv_24x7_reqb' was not declared. Should it be static?
arch/powerpc/perf/hv-24x7.c:245:1: warning:
symbol '__pcpu_scope_hv_24x7_resb' was not declared. Should it be static?
This symbol is not used outside of hv-24x7.c, so this
commit marks it static.
Leonardo Bras [Thu, 8 Apr 2021 20:19:16 +0000 (17:19 -0300)]
powerpc/iommu: Enable remaining IOMMU Pagesizes present in LoPAR
According to LoPAR, ibm,query-pe-dma-window output named "IO Page Sizes"
will let the OS know all possible pagesizes that can be used for creating a
new DDW.
Currently Linux will only try using 3 of the 8 available options:
4K, 64K and 16M. According to LoPAR, Hypervisor may also offer 32M, 64M,
128M, 256M and 16G.
Enabling bigger pages would be interesting for direct mapping systems
with a lot of RAM, while using less TCE entries.
powerpc/rtas: rename RTAS_RMOBUF_MAX to RTAS_USER_REGION_SIZE
RTAS_RMOBUF_MAX doesn't actually describe a "maximum" value in any
sense. It represents the size of an area of memory set aside for user
space to use as work areas for certain RTAS calls.
powerpc/rtas: move syscall filter setup into separate function
Reduce conditionally compiled sections within rtas_initialize() by
moving the filter table initialization into its own function already
guarded by CONFIG_PPC_RTAS_FILTER. No behavior change intended.
There's not a compelling reason to cache the value of the token for
the ibm,suspend-me function. Just look it up when needed in the RTAS
syscall's special case for it.
powerpc/eeh: Fix EEH handling for hugepages in ioremap space.
During the EEH MMIO error checking, the current implementation fails to map
the (virtual) MMIO address back to the pci device on radix with hugepage
mappings for I/O. This results into failure to dispatch EEH event with no
recovery even when EEH capability has been enabled on the device.
In case of hugepage mappings, eeh_token_to_phys() has a bug in virt -> phys
translation that results in wrong physical address, which is then passed to
eeh_addr_cache_get_dev() to match it against cached pci I/O address ranges
to get to a PCI device. Hence, it fails to find a match and the EEH event
never gets dispatched leaving the device in failed state.
The commit 33439620680be ("powerpc/eeh: Handle hugepages in ioremap space")
introduced following logic to translate virt to phys for hugepage mappings:
eeh_token_to_phys():
+ pa = pte_pfn(*ptep);
+
+ /* On radix we can do hugepage mappings for io, so handle that */
+ if (hugepage_shift) {
+ pa <<= hugepage_shift; <= This is wrong
+ pa |= token & ((1ul << hugepage_shift) - 1);
+ }
This patch fixes the virt -> phys translation in eeh_token_to_phys()
function.
$ cat /sys/kernel/debug/powerpc/eeh_address_cache
mem addr range [0x0000040080000000-0x00000400807fffff]: 0030:01:00.1
mem addr range [0x0000040080800000-0x0000040080ffffff]: 0030:01:00.1
mem addr range [0x0000040081000000-0x00000400817fffff]: 0030:01:00.0
mem addr range [0x0000040081800000-0x0000040081ffffff]: 0030:01:00.0
mem addr range [0x0000040082000000-0x000004008207ffff]: 0030:01:00.1
mem addr range [0x0000040082080000-0x00000400820fffff]: 0030:01:00.0
mem addr range [0x0000040082100000-0x000004008210ffff]: 0030:01:00.1
mem addr range [0x0000040082110000-0x000004008211ffff]: 0030:01:00.0
Above is the list of cached io address ranges of pci 0030:01:00.<fn>.
Before this patch:
Tracing 'arg1' of function eeh_addr_cache_get_dev() during error injection
clearly shows that 'addr=' contains wrong physical address:
Cédric Le Goater [Wed, 31 Mar 2021 14:45:14 +0000 (16:45 +0200)]
powerpc/xive: Modernize XIVE-IPI domain with an 'alloc' handler
Instead of calling irq_create_mapping() to map the IPI for a node,
introduce an 'alloc' handler. This is usually an extension to support
hierarchy irq_domains which is not exactly the case for XIVE-IPI
domain. However, we can now use the irq_domain_alloc_irqs() routine
which allocates the IRQ descriptor on the specified node, even better
for cache performance on multi node machines.
Cédric Le Goater [Wed, 31 Mar 2021 14:45:13 +0000 (16:45 +0200)]
powerpc/xive: Map one IPI interrupt per node
ipistorm [*] can be used to benchmark the raw interrupt rate of an
interrupt controller by measuring the number of IPIs a system can
sustain. When applied to the XIVE interrupt controller of POWER9 and
POWER10 systems, a significant drop of the interrupt rate can be
observed when crossing the second node boundary.
This is due to the fact that a single IPI interrupt is used for all
CPUs of the system. The structure is shared and the cache line updates
impact greatly the traffic between nodes and the overall IPI
performance.
As a workaround, the impact can be reduced by deactivating the IRQ
lockup detector ("noirqdebug") which does a lot of accounting in the
Linux IRQ descriptor structure and is responsible for most of the
performance penalty.
As a fix, this proposal allocates an IPI interrupt per node, to be
shared by all CPUs of that node. It solves the scaling issue, the IRQ
lockup detector still has an impact but the XIVE interrupt rate scales
linearly. It also improves the "noirqdebug" case as showed in the
tables below.
Cédric Le Goater [Wed, 31 Mar 2021 14:45:12 +0000 (16:45 +0200)]
powerpc/xive: Fix xmon command "dxi"
When under xmon, the "dxi" command dumps the state of the XIVE
interrupts. If an interrupt number is specified, only the state of
the associated XIVE interrupt is dumped. This form of the command
lacks an irq_data parameter which is nevertheless used by
xmon_xive_get_irq_config(), leading to an xmon crash.
Fix that by doing a lookup in the system IRQ mapping to query the IRQ
descriptor data. Invalid interrupt numbers, or not belonging to the
XIVE IRQ domain, OPAL event interrupt number for instance, should be
caught by the previous query done at the firmware level.
Fixes: 97ef27507793 ("powerpc/xive: Fix xmon support on the PowerNV platform") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Cédric Le Goater <clg@kaod.org> Tested-by: Greg Kurz <groug@kaod.org> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210331144514.892250-8-clg@kaod.org
Cédric Le Goater [Wed, 31 Mar 2021 14:45:09 +0000 (16:45 +0200)]
powerpc/xive: Simplify xive_core_debug_show()
Now that the IPI interrupt has its own domain, the checks on the HW
interrupt number XIVE_IPI_HW_IRQ and on the chip can be replaced by a
check on the domain.
Cédric Le Goater [Wed, 31 Mar 2021 14:45:07 +0000 (16:45 +0200)]
powerpc/xive: Introduce an IPI interrupt domain
The IPI interrupt is a special case of the XIVE IRQ domain. When
mapping and unmapping the interrupts in the Linux interrupt number
space, the HW interrupt number 0 (XIVE_IPI_HW_IRQ) is checked to
distinguish the IPI interrupt from other interrupts of the system.
Simplify the XIVE interrupt domain by introducing a specific domain
for the IPI.
arch/powerpc/kernel/smp.c:86:1: warning:
symbol '__pcpu_scope_cpu_coregroup_map' was not declared. Should it be static?
arch/powerpc/kernel/smp.c:125:1: warning:
symbol '__pcpu_scope_thread_group_l1_cache_map' was not declared. Should it be static?
arch/powerpc/kernel/smp.c:132:1: warning:
symbol '__pcpu_scope_thread_group_l2_cache_map' was not declared. Should it be static?
These symbols are not used outside of smp.c, so this
commit marks them static.
drivers/macintosh/via-pmu.c:183:5: warning:
symbol 'pmu_cur_battery' was not declared. Should it be static?
drivers/macintosh/via-pmu.c:190:5: warning:
symbol '__fake_sleep' was not declared. Should it be static?
These symbols are not used outside of via-pmu.c, so this
commit marks them static.
powerpc/32s: Define a MODULE area below kernel text all the time
On book3s/32, the segment below kernel text is used for module
allocation when CONFIG_STRICT_KERNEL_RWX is defined.
In order to benefit from the powerpc specific module_alloc()
function which allocate modules with 32 Mbytes from
end of kernel text, use that segment below PAGE_OFFSET at all time.
powerpc/8xx: Define a MODULE area below kernel text
On the 8xx, TASK_SIZE is 0x80000000. The space between TASK_SIZE
and PAGE_OFFSET is not used.
In order to benefit from the powerpc specific module_alloc()
function which allocate modules with 32 Mbytes from
end of kernel text, define MODULES_VADDR and MODULES_END.
Set a 256Mb area just below PAGE_OFFSET, like book3s/32.
powerpc/modules: Load modules closer to kernel text
On book3s/32, when STRICT_KERNEL_RWX is selected, modules are
allocated on the segment just before kernel text, ie on the
0xb0000000-0xbfffffff when PAGE_OFFSET is 0xc0000000.
On the 8xx, TASK_SIZE is 0x80000000. The space between TASK_SIZE and
PAGE_OFFSET is not used and could be used for modules.
The idea comes from ARM architecture.
Having modules just below PAGE_OFFSET offers an opportunity to
minimise the distance between kernel text and modules and avoid
trampolines in modules to access kernel functions or other module
functions.
When MODULES_VADDR is defined, powerpc has it's own module_alloc()
function. In that function, first try to allocate the module
above the limit defined by '_etext - 32M'. Then if the allocation
fails, fallback to the entire MODULES area.
DEBUG logs in module_32.c without the patch:
[ 1572.588822] module_32: Applying ADD relocate section 13 to 12
[ 1572.588891] module_32: Doing plt for call to 0xc00671a4 at 0xcae04024
[ 1572.588964] module_32: Initialized plt for 0xc00671a4 at cae04000
[ 1572.589037] module_32: REL24 value = CAE04000. location = CAE04024
[ 1572.589110] module_32: Location before: 48000001.
[ 1572.589171] module_32: Location after: 4BFFFFDD.
[ 1572.589231] module_32: ie. jump to 03FFFFDC+CAE04024 = CEE04000
[ 1572.589317] module_32: Applying ADD relocate section 15 to 14
[ 1572.589386] module_32: Doing plt for call to 0xc00671a4 at 0xcadfc018
[ 1572.589457] module_32: Initialized plt for 0xc00671a4 at cadfc000
[ 1572.589529] module_32: REL24 value = CADFC000. location = CADFC018
[ 1572.589601] module_32: Location before: 48000000.
[ 1572.589661] module_32: Location after: 4BFFFFE8.
[ 1572.589723] module_32: ie. jump to 03FFFFE8+CADFC018 = CEDFC000