Catalin Marinas [Wed, 25 Mar 2020 11:10:32 +0000 (11:10 +0000)]
Merge branches 'for-next/memory-hotremove', 'for-next/arm_sdei', 'for-next/amu', 'for-next/final-cap-helper', 'for-next/cpu_ops-cleanup', 'for-next/misc' and 'for-next/perf' into for-next/core
* for-next/memory-hotremove:
: Memory hot-remove support for arm64
arm64/mm: Enable memory hot remove
arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
* for-next/arm_sdei:
: SDEI: fix double locking on return from hibernate and clean-up
firmware: arm_sdei: clean up sdei_event_create()
firmware: arm_sdei: Use cpus_read_lock() to avoid races with cpuhp
firmware: arm_sdei: fix possible double-lock on hibernate error path
firmware: arm_sdei: fix double-lock on hibernate with shared events
* for-next/amu:
: ARMv8.4 Activity Monitors support
clocksource/drivers/arm_arch_timer: validate arch_timer_rate
arm64: use activity monitors for frequency invariance
cpufreq: add function to get the hardware max frequency
Documentation: arm64: document support for the AMU extension
arm64/kvm: disable access to AMU registers from kvm guests
arm64: trap to EL1 accesses to AMU counters from EL0
arm64: add support for the AMU extension v1
* for-next/final-cap-helper:
: Introduce cpus_have_final_cap_helper(), migrate arm64 KVM to it
arm64: kvm: hyp: use cpus_have_final_cap()
arm64: cpufeature: add cpus_have_final_cap()
* for-next/cpu_ops-cleanup:
: cpu_ops[] access code clean-up
arm64: Introduce get_cpu_ops() helper function
arm64: Rename cpu_read_ops() to init_cpu_ops()
arm64: Declare ACPI parking protocol CPU operation if needed
* for-next/misc:
: Various fixes and clean-ups
arm64: define __alloc_zeroed_user_highpage
arm64/kernel: Simplify __cpu_up() by bailing out early
arm64: remove redundant blank for '=' operator
arm64: kexec_file: Fixed code style.
arm64: add blank after 'if'
arm64: fix spelling mistake "ca not" -> "cannot"
arm64: entry: unmask IRQ in el0_sp()
arm64: efi: add efi-entry.o to targets instead of extra-$(CONFIG_EFI)
arm64: csum: Optimise IPv6 header checksum
arch/arm64: fix typo in a comment
arm64: remove gratuitious/stray .ltorg stanzas
arm64: Update comment for ASID() macro
arm64: mm: convert cpu_do_switch_mm() to C
arm64: fix NUMA Kconfig typos
* for-next/perf:
: arm64 perf updates
arm64: perf: Add support for ARMv8.5-PMU 64-bit counters
KVM: arm64: limit PMU version to PMUv3 for ARMv8.1
arm64: cpufeature: Extract capped perfmon fields
arm64: perf: Clean up enable/disable calls
perf: arm-ccn: Use scnprintf() for robustness
arm64: perf: Support new DT compatibles
arm64: perf: Refactor PMU init callbacks
perf: arm_spe: Remove unnecessary zero check on 'nr_pages'
Gavin Shan [Wed, 18 Mar 2020 23:01:44 +0000 (10:01 +1100)]
arm64: Introduce get_cpu_ops() helper function
This introduces get_cpu_ops() to return the CPU operations according to
the given CPU index. For now, it simply returns the @cpu_ops[cpu] as
before. Also, helper function __cpu_try_die() is introduced to be shared
by cpu_die() and ipi_cpu_crash_stop(). So it shouldn't introduce any
functional changes.
Gavin Shan [Wed, 18 Mar 2020 23:01:43 +0000 (10:01 +1100)]
arm64: Rename cpu_read_ops() to init_cpu_ops()
This renames cpu_read_ops() to init_cpu_ops() as the function is only
called in initialization phase. Also, we will introduce get_cpu_ops() in
the subsequent patches, to retireve the CPU operation by the given CPU
index. The usage of cpu_read_ops() and get_cpu_ops() are difficult to be
distinguished from their names.
Gavin Shan [Wed, 18 Mar 2020 23:01:42 +0000 (10:01 +1100)]
arm64: Declare ACPI parking protocol CPU operation if needed
It's obvious we needn't declare the corresponding CPU operation when
CONFIG_ARM64_ACPI_PARKING_PROTOCOL is disabled, even it doesn't cause
any compiling warnings.
Andrew Murray [Mon, 2 Mar 2020 18:17:52 +0000 (18:17 +0000)]
arm64: perf: Add support for ARMv8.5-PMU 64-bit counters
At present ARMv8 event counters are limited to 32-bits, though by
using the CHAIN event it's possible to combine adjacent counters to
achieve 64-bits. The perf config1:0 bit can be set to use such a
configuration.
With the introduction of ARMv8.5-PMU support, all event counters can
now be used as 64-bit counters.
Let's enable 64-bit event counters where support exists. Unless the
user sets config1:0 we will adjust the counter value such that it
overflows upon 32-bit overflow. This follows the same behaviour as
the cycle counter which has always been (and remains) 64-bits.
Signed-off-by: Andrew Murray <andrew.murray@arm.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
[Mark: fix ID field names, compare with 8.5 value] Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Andrew Murray [Mon, 2 Mar 2020 18:17:51 +0000 (18:17 +0000)]
KVM: arm64: limit PMU version to PMUv3 for ARMv8.1
We currently expose the PMU version of the host to the guest via
emulation of the DFR0_EL1 and AA64DFR0_EL1 debug feature registers.
However many of the features offered beyond PMUv3 for 8.1 are not
supported in KVM. Examples of this include support for the PMMIR
registers (added in PMUv3 for ARMv8.4) and 64-bit event counters
added in (PMUv3 for ARMv8.5).
Let's trap the Debug Feature Registers in order to limit
PMUVer/PerfMon in the Debug Feature Registers to PMUv3 for ARMv8.1
to avoid unexpected behaviour.
Both ID_AA64DFR0.PMUVer and ID_DFR0.PerfMon follow the "Alternative ID
scheme used for the Performance Monitors Extension version" where 0xF
means an IMPLEMENTATION DEFINED PMU is implemented, and values 0x0-0xE
are treated as with an unsigned field (with 0x0 meaning no PMU is
present). As we don't expect to expose an IMPLEMENTATION DEFINED PMU,
and our cap is below 0xF, we can treat these fields as unsigned when
applying the cap.
Signed-off-by: Andrew Murray <andrew.murray@arm.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
[Mark: make field names consistent, use perfmon cap] Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Andrew Murray [Mon, 2 Mar 2020 18:17:50 +0000 (18:17 +0000)]
arm64: cpufeature: Extract capped perfmon fields
When emulating ID registers there is often a need to cap the version
bits of a feature such that the guest will not use features that the
host is not aware of. For example, when KVM mediates access to the PMU
by emulating register accesses.
Let's add a helper that extracts a performance monitors ID field and
caps the version to a given value.
Fields that identify the version of the Performance Monitors Extension
do not follow the standard ID scheme, and instead follow the scheme
described in ARM DDI 0487E.a page D13-2825 "Alternative ID scheme used
for the Performance Monitors Extension version". The value 0xF means an
IMPLEMENTATION DEFINED PMU is present, and values 0x0-OxE can be treated
the same as an unsigned field with 0x0 meaning no PMU is present.
Signed-off-by: Andrew Murray <andrew.murray@arm.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
[Mark: rework to handle perfmon fields] Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Robin Murphy [Tue, 17 Mar 2020 18:22:54 +0000 (18:22 +0000)]
arm64: perf: Clean up enable/disable calls
Reading this code bordered on painful, what with all the repetition and
pointless return values. More fundamentally, dribbling the hardware
enables and disables in one bit at a time incurs needless system
register overhead for chained events and on reset. We already use
bitmask values for the KVM hooks, so consolidate all the register
accesses to match, and make a reasonable saving in both source and
object code.
Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Takashi Iwai [Sun, 15 Mar 2020 09:37:15 +0000 (10:37 +0100)]
perf: arm-ccn: Use scnprintf() for robustness
snprintf() is a hard-to-use function, it's especially difficult to use
it for concatenating substrings in a buffer with a limited size.
Since snprintf() returns the would-be-output size, not the actual
size, the subsequent use of snprintf() may point to the incorrect
position easily. Although the current code doesn't actually overflow
the buffer, it's an incorrect usage.
This patch replaces such snprintf() calls with a safer version,
scnprintf().
Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Will Deacon <will@kernel.org>
glider@google.com [Thu, 12 Mar 2020 15:59:20 +0000 (16:59 +0100)]
arm64: define __alloc_zeroed_user_highpage
When running the kernel with init_on_alloc=1, calling the default
implementation of __alloc_zeroed_user_highpage() from include/linux/highmem.h
leads to double-initialization of the allocated page (first by the page
allocator, then by clear_user_page().
Calling alloc_page_vma() with __GFP_ZERO, similarly to e.g. x86, seems
to be enough to ensure the user page is zeroed only once.
Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Gavin Shan [Mon, 2 Mar 2020 02:03:40 +0000 (13:03 +1100)]
arm64/kernel: Simplify __cpu_up() by bailing out early
The function __cpu_up() is invoked to bring up the target CPU through
the backend, PSCI for example. The nested if statements won't be needed
if we bail out early on the following two conditions where the status
won't be checked. The code looks simplified in that case.
* Error returned from the backend (e.g. PSCI)
* The target CPU has been marked as onlined
Mark Rutland [Fri, 21 Feb 2020 14:50:22 +0000 (14:50 +0000)]
arm64: kvm: hyp: use cpus_have_final_cap()
The KVM hyp code is only run after system capabilities have been
finalized, and thus all const cap checks have been patched. This is
noted in in __cpu_init_hyp_mode(), where we BUG() if called too early:
| /*
| * Call initialization code, and switch to the full blown HYP code.
| * If the cpucaps haven't been finalized yet, something has gone very
| * wrong, and hyp will crash and burn when it uses any
| * cpus_have_const_cap() wrapper.
| */
Given this, the hyp code can use cpus_have_final_cap() and avoid
generating code to check the cpu_hwcaps array, which would be unsafe to
run in hyp context.
This patch migrate the KVM hyp code to cpus_have_final_cap(), avoiding
this redundant code generation, and making it possible to detect if we
accidentally invoke this code too early. In the latter case, the BUG()
in cpus_have_final_cap() will cause a hyp panic.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Julien Thierry <julien.thierry.kdev@gmail.com> Cc: Suzuki Poulouse <suzuki.poulose@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Mark Rutland [Fri, 21 Feb 2020 14:50:21 +0000 (14:50 +0000)]
arm64: cpufeature: add cpus_have_final_cap()
When cpus_have_const_cap() was originally introduced it was intended to
be safe in hyp context, where it is not safe to access the cpu_hwcaps
array as cpus_have_cap() did. For more details see commit:
a4023f682739439b ("arm64: Add hypervisor safe helper for checking constant capabilities")
We then made use of cpus_have_const_cap() throughout the kernel.
Subsequently, we had to defer updating the static_key associated with
each capability in order to avoid lockdep complaints. To avoid breaking
kernel-wide usage of cpus_have_const_cap(), this was updated to fall
back to the cpu_hwcaps array if called before the static_keys were
updated. As the kvm hyp code was only called later than this, the
fallback is redundant but not functionally harmful. For more details,
see commit:
63a1e1c95e60e798 ("arm64/cpufeature: don't use mutex in bringup path")
Today we have more users of cpus_have_const_cap() which are only called
once the relevant static keys are initialized, and it would be
beneficial to avoid the redundant code.
To that end, this patch adds a new cpus_have_final_cap(), helper which
is intend to be used in code which is only run once capabilities have
been finalized, and will never check the cpus_hwcap array. This helps
the compiler to generate better code as it no longer needs to generate
code to address and test the cpus_hwcap array. To help catch misuse,
cpus_have_final_cap() will BUG() if called before capabilities are
finalized.
In hyp context, BUG() will result in a hyp panic, but the specific BUG()
instance will not be identified in the usual way.
Comments are added to the various cpus_have_*_cap() helpers to describe
the constraints on when they can be used. For clarity cpus_have_cap() is
moved above the other helpers. Similarly the helpers are updated to use
system_capabilities_finalized() consistently, and this is made
__always_inline as required by its new callers.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Mark Rutland [Fri, 28 Feb 2020 14:59:42 +0000 (14:59 +0000)]
arm64: entry: unmask IRQ in el0_sp()
Currently, the EL0 SP alignment handler masks IRQs unnecessarily. It
does so due to historic code sharing of the EL0 SP and PC alignment
handlers, and branch predictor hardening applicable to the EL0 SP
handler.
We began masking IRQs in the EL0 SP alignment handler in commit:
5dfc6ed27710c42c ("arm64: entry: Apply BP hardening for high-priority synchronous exception")
... as this shared code with the EL0 PC alignment handler, and branch
predictor hardening made it necessary to disable IRQs for early parts of
the EL0 PC alignment handler. It was not necessary to mask IRQs during
EL0 SP alignment exceptions, but it was not considered harmful to do so.
This masking was carried forward into C code in commit:
Robin Murphy [Mon, 20 Jan 2020 18:52:29 +0000 (18:52 +0000)]
arm64: csum: Optimise IPv6 header checksum
Throwing our __uint128_t idioms at csum_ipv6_magic() makes it
about 1.3x-2x faster across a range of microarchitecture/compiler
combinations. Not much in absolute terms, but every little helps.
Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Using an arch timer with a frequency of less than 1MHz can potentially
result in incorrect functionality in systems that assume a reasonable
rate of the arch timer of 1 to 50MHz, described as typical in the
architecture specification.
Therefore, warn if the arch timer rate is below 1MHz, which is
considered atypical and worth emphasizing.
Suggested-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Acked-by: Marc Zyngier <maz@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Ionela Voinescu [Thu, 5 Mar 2020 09:06:26 +0000 (09:06 +0000)]
arm64: use activity monitors for frequency invariance
The Frequency Invariance Engine (FIE) is providing a frequency
scaling correction factor that helps achieve more accurate
load-tracking.
So far, for arm and arm64 platforms, this scale factor has been
obtained based on the ratio between the current frequency and the
maximum supported frequency recorded by the cpufreq policy. The
setting of this scale factor is triggered from cpufreq drivers by
calling arch_set_freq_scale. The current frequency used in computation
is the frequency requested by a governor, but it may not be the
frequency that was implemented by the platform.
This correction factor can also be obtained using a core counter and a
constant counter to get information on the performance (frequency based
only) obtained in a period of time. This will more accurately reflect
the actual current frequency of the CPU, compared with the alternative
implementation that reflects the request of a performance level from
the OS.
Therefore, implement arch_scale_freq_tick to use activity monitors, if
present, for the computation of the frequency scale factor.
The use of AMU counters depends on:
- CONFIG_ARM64_AMU_EXTN - depents on the AMU extension being present
- CONFIG_CPU_FREQ - the current frequency obtained using counter
information is divided by the maximum frequency obtained from the
cpufreq policy.
While it is possible to have a combination of CPUs in the system with
and without support for activity monitors, the use of counters for
frequency invariance is only enabled for a CPU if all related CPUs
(CPUs in the same frequency domain) support and have enabled the core
and constant activity monitor counters. In this way, there is a clear
separation between the policies for which arch_set_freq_scale (cpufreq
based FIE) is used, and the policies for which arch_scale_freq_tick
(counter based FIE) is used to set the frequency scale factor. For
this purpose, a late_initcall_sync is registered to trigger validation
work for policies that will enable or disable the use of AMU counters
for frequency invariance. If CONFIG_CPU_FREQ is not defined, the use
of counters is enabled on all CPUs only if all possible CPUs correctly
support the necessary counters.
Ionela Voinescu [Thu, 5 Mar 2020 09:06:25 +0000 (09:06 +0000)]
cpufreq: add function to get the hardware max frequency
Add weak function to return the hardware maximum frequency of a CPU,
with the default implementation returning cpuinfo.max_freq, which is
the best information we can generically get from the cpufreq framework.
The default can be overwritten by a strong function in platforms
that want to provide an alternative implementation, with more accurate
information, obtained either from hardware or firmware.
Ionela Voinescu [Thu, 5 Mar 2020 09:06:23 +0000 (09:06 +0000)]
arm64/kvm: disable access to AMU registers from kvm guests
Access to the AMU counters should be disabled by default in kvm guests,
as information from the counters might reveal activity in other guests
or activity on the host.
Therefore, disable access to AMU registers from EL0 and EL1 in kvm
guests by:
- Hiding the presence of the extension in the feature register
(SYS_ID_AA64PFR0_EL1) on the VCPU.
- Disabling access to the AMU registers before switching to the guest.
- Trapping accesses and injecting an undefined instruction into the
guest.
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Acked-by: Marc Zyngier <maz@kernel.org> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Julien Thierry <julien.thierry.kdev@gmail.com> Cc: James Morse <james.morse@arm.com> Cc: Marc Zyngier <maz@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Ionela Voinescu [Thu, 5 Mar 2020 09:06:22 +0000 (09:06 +0000)]
arm64: trap to EL1 accesses to AMU counters from EL0
The activity monitors extension is an optional extension introduced
by the ARMv8.4 CPU architecture. In order to access the activity
monitors counters safely, if desired, the kernel should detect the
presence of the extension through the feature register, and mediate
the access.
Therefore, disable direct accesses to activity monitors counters
from EL0 (userspace) and trap them to EL1 (kernel).
To be noted that the ARM64_AMU_EXTN kernel config does not have an
effect on this code. Given that the amuserenr_el0 resets to an
UNKNOWN value, setting the trap of EL0 accesses to EL1 is always
attempted for safety and security considerations. Therefore firmware
should still ensure accesses to AMU registers are not trapped in
EL2/EL3 as this code cannot be bypassed if the CPU implements the
Activity Monitors Unit.
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Reviewed-by: James Morse <james.morse@arm.com> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Steve Capper <steve.capper@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Ionela Voinescu [Thu, 5 Mar 2020 09:06:21 +0000 (09:06 +0000)]
arm64: add support for the AMU extension v1
The activity monitors extension is an optional extension introduced
by the ARMv8.4 CPU architecture. This implements basic support for
version 1 of the activity monitors architecture, AMUv1.
This support includes:
- Extension detection on each CPU (boot, secondary, hotplugged)
- Register interface for AMU aarch64 registers
Signed-off-by: Ionela Voinescu <ionela.voinescu@arm.com> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Anshuman Khandual [Wed, 4 Mar 2020 04:28:43 +0000 (09:58 +0530)]
arm64/mm: Enable memory hot remove
The arch code for hot-remove must tear down portions of the linear map and
vmemmap corresponding to memory being removed. In both cases the page
tables mapping these regions must be freed, and when sparse vmemmap is in
use the memory backing the vmemmap must also be freed.
This patch adds unmap_hotplug_range() and free_empty_tables() helpers which
can be used to tear down either region and calls it from vmemmap_free() and
___remove_pgd_mapping(). The free_mapped argument determines whether the
backing memory will be freed.
It makes two distinct passes over the kernel page table. In the first pass
with unmap_hotplug_range() it unmaps, invalidates applicable TLB cache and
frees backing memory if required (vmemmap) for each mapped leaf entry. In
the second pass with free_empty_tables() it looks for empty page table
sections whose page table page can be unmapped, TLB invalidated and freed.
While freeing intermediate level page table pages bail out if any of its
entries are still valid. This can happen for partially filled kernel page
table either from a previously attempted failed memory hot add or while
removing an address range which does not span the entire page table page
range.
The vmemmap region may share levels of table with the vmalloc region.
There can be conflicts between hot remove freeing page table pages with
a concurrent vmalloc() walking the kernel page table. This conflict can
not just be solved by taking the init_mm ptl because of existing locking
scheme in vmalloc(). So free_empty_tables() implements a floor and ceiling
method which is borrowed from user page table tear with free_pgd_range()
which skips freeing page table pages if intermediate address range is not
aligned or maximum floor-ceiling might not own the entire page table page.
Boot memory on arm64 cannot be removed. Hence this registers a new memory
hotplug notifier which prevents boot memory offlining and it's removal.
While here update arch_add_memory() to handle __add_pages() failures by
just unmapping recently added kernel linear mapping. Now enable memory hot
remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.
This implementation is overall inspired from kernel page table tear down
procedure on X86 architecture and user page table tear down method.
[Mike and Catalin added P4D page table level support]
Anshuman Khandual [Wed, 4 Mar 2020 04:28:42 +0000 (09:58 +0530)]
arm64/mm: Hold memory hotplug lock while walking for kernel page table dump
The arm64 page table dump code can race with concurrent modification of the
kernel page tables. When a leaf entries are modified concurrently, the dump
code may log stale or inconsistent information for a VA range, but this is
otherwise not harmful.
When intermediate levels of table are freed, the dump code will continue to
use memory which has been freed and potentially reallocated for another
purpose. In such cases, the dump code may dereference bogus addresses,
leading to a number of potential problems.
Intermediate levels of table may by freed during memory hot-remove,
which will be enabled by a subsequent patch. To avoid racing with
this, take the memory hotplug lock when walking the kernel page table.
Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Steven Price <steven.price@arm.com> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Robin Murphy [Fri, 21 Feb 2020 19:35:32 +0000 (19:35 +0000)]
arm64: perf: Support new DT compatibles
Add support for matching the new PMUs. For now, this just wires them up
as generic PMUv3 such that people writing DTs for new SoCs can do the
right thing, and at least have architectural and raw events be usable.
We can come back and fill in event maps for sysfs and/or perf tools at
a later date.
Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Robin Murphy [Fri, 21 Feb 2020 19:35:31 +0000 (19:35 +0000)]
arm64: perf: Refactor PMU init callbacks
The PMU init callbacks are already drowning in boilerplate, so before
doubling the number of supported PMU models, give it a sensible refactor
to significantly reduce the bloat, both in source and object code.
Although nobody uses non-default sysfs attributes today, there's minimal
impact to preserving the notion that maybe, some day, somebody might, so
we may as well keep up appearances.
Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
Will Deacon [Fri, 28 Feb 2020 12:43:55 +0000 (12:43 +0000)]
arm64: Update comment for ASID() macro
Commit 25b92693a1b6 ("arm64: mm: convert cpu_do_switch_mm() to C") added
a new use of the ASID() macro, so update the comment in asm/mmu.h which
reasons about why an atomic reload of 'mm->context.id.counter' is not
required.
Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Liguang Zhang [Fri, 21 Feb 2020 16:35:09 +0000 (16:35 +0000)]
firmware: arm_sdei: clean up sdei_event_create()
Function sdei_event_find() is always called in sdei_event_create(), but
it is already called in sdei_event_register(). This code is trying to
avoid a double-create of the same event, which can't happen as we still
hold the sdei_events_lock. We can remove this needless sdei_event_find()
call.
James Morse [Fri, 21 Feb 2020 16:35:08 +0000 (16:35 +0000)]
firmware: arm_sdei: Use cpus_read_lock() to avoid races with cpuhp
SDEI has private events that need registering and enabling on each CPU.
CPUs can come and go while we are trying to do this. SDEI tries to avoid
these problems by setting the reregister flag before the register call,
so any CPUs that come online register the event too. Sticking plaster
like this doesn't work, as if the register call fails, a CPU that
subsequently comes online will register the event before reregister
is cleared.
Take cpus_read_lock() around the register and enable calls. We don't
want surprise CPUs to do the wrong thing if they race with these calls
failing.
Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Liguang Zhang [Fri, 21 Feb 2020 16:35:07 +0000 (16:35 +0000)]
firmware: arm_sdei: fix possible double-lock on hibernate error path
We call sdei_reregister_event() with sdei_list_lock held, if the register
fails we call sdei_event_destroy() which also acquires sdei_list_lock
thus creating A-A deadlock.
Add '_llocked' to sdei_reregister_event(), to indicate the list lock
is held, and add a _llocked variant of sdei_event_destroy().
Fixes: da351827240e ("firmware: arm_sdei: Add support for CPU and system power states") Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
[expanded subject, added wrappers instead of duplicating contents] Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
James Morse [Fri, 21 Feb 2020 16:35:06 +0000 (16:35 +0000)]
firmware: arm_sdei: fix double-lock on hibernate with shared events
SDEI has private events that must be registered on each CPU. When
CPUs come and go they must re-register and re-enable their private
events. Each event has flags to indicate whether this should happen
to protect against an event being registered on a CPU coming online,
while all the others are unregistering the event.
These flags are protected by the sdei_list_lock spinlock, because
the cpuhp callbacks can't take the mutex.
Hibernate needs to unregister all events, but keep the in-memory
re-register and re-enable as they are. sdei_unregister_shared()
takes the spinlock to walk the list, then calls _sdei_event_unregister()
on each shared event. _sdei_event_unregister() tries to take the
same spinlock to update re-register and re-enable. This doesn't go
so well.
Push the re-register and re-enable updates out to their callers.
sdei_unregister_shared() doesn't want these values updated, so
doesn't need to do anything.
This also fixes shared events getting lost over hibernate as this
path made them look unregistered.
Fixes: da351827240e ("firmware: arm_sdei: Add support for CPU and system power states") Reported-by: Liguang Zhang <zhangliguang@linux.alibaba.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Mark Rutland [Thu, 13 Feb 2020 12:14:52 +0000 (12:14 +0000)]
arm64: mm: convert cpu_do_switch_mm() to C
There's no reason that cpu_do_switch_mm() needs to be written as an
assembly function, and having it as a C function would make it easier to
maintain.
This patch converts cpu_do_switch_mm() to C, removing code that this
change makes redundant (e.g. the mmid macro). Since the header comment
was stale and the prototype now implies all the necessary information,
this comment is removed. The 'pgd_phys' argument is made a phys_addr_t
to match the return type of virt_to_phys().
At the same time, post_ttbr_update_workaround() is updated to use
IS_ENABLED(), which allows the compiler to figure out it can elide calls
for !CONFIG_CAVIUM_ERRATUM_27456 builds.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org>
[catalin.marinas@arm.com: change comments from asm-style to C-style] Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Linus Torvalds [Sun, 23 Feb 2020 17:43:50 +0000 (09:43 -0800)]
Merge tag 'for-5.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"These are fixes that were found during testing with help of error
injection, plus some other stable material.
There's a fixup to patch added to rc1 causing locking in wrong context
warnings, tests found one more deadlock scenario. The patches are
tagged for stable, two of them now in the queue but we'd like all
three released at the same time.
I'm not happy about fixes to fixes in such a fast succession during
rcs, but I hope we found all the fallouts of commit 28553fa992cb
('Btrfs: fix race between shrinking truncate and fiemap')"
* tag 'for-5.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix deadlock during fast fsync when logging prealloc extents beyond eof
Btrfs: fix btrfs_wait_ordered_range() so that it waits for all ordered extents
btrfs: fix bytes_may_use underflow in prealloc error condtition
btrfs: handle logged extent failure properly
btrfs: do not check delayed items are empty for single transaction cleanup
btrfs: reset fs_root to NULL on error in open_ctree
btrfs: destroy qgroup extent records on transaction abort
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix mount failure with quota configured as module
jbd2: fix ocfs2 corrupt when clearing block group bits
ext4: fix race between writepages and enabling EXT4_EXTENTS_FL
ext4: rename s_journal_flag_rwsem to s_writepages_rwsem
ext4: fix potential race between s_flex_groups online resizing and access
ext4: fix potential race between s_group_info online resizing and access
ext4: fix potential race between online resizing and write operations
ext4: add cond_resched() to __ext4_find_entry()
ext4: fix a data race in EXT4_I(inode)->i_disksize
Linus Torvalds [Sun, 23 Feb 2020 17:37:41 +0000 (09:37 -0800)]
Merge tag 'csky-for-linus-5.6-rc3' of git://github.com/c-sky/csky-linux
Pull csky updates from Guo Ren:
"Sorry, I missed 5.6-rc1 merge window, but in this pull request the
most are the fixes and the rests are between fixes and features. The
only outside modification is the MAINTAINERS file update with our
mailing list.
- cache flush implementation fixes
- ftrace modify panic fix
- CONFIG_SMP boot problem fix
- fix pt_regs saving for atomic.S
- fix fixaddr_init without highmem.
- fix stack protector support
- fix fake Tightly-Coupled Memory code compile and use
- fix some typos and coding convention"
* tag 'csky-for-linus-5.6-rc3' of git://github.com/c-sky/csky-linux: (23 commits)
csky: Replace <linux/clk-provider.h> by <linux/of_clk.h>
csky: Implement copy_thread_tls
csky: Add PCI support
csky: Minimize defconfig to support buildroot config.fragment
csky: Add setup_initrd check code
csky: Cleanup old Kconfig options
arch/csky: fix some Kconfig typos
csky: Fixup compile warning for three unimplemented syscalls
csky: Remove unused cache implementation
csky: Fixup ftrace modify panic
csky: Add flush_icache_mm to defer flush icache all
csky: Optimize abiv2 copy_to_user_page with VM_EXEC
csky: Enable defer flush_dcache_page for abiv2 cpus (807/810/860)
csky: Remove unnecessary flush_icache_* implementation
csky: Support icache flush without specific instructions
csky/Kconfig: Add Kconfig.platforms to support some drivers
csky/smp: Fixup boot failed when CONFIG_SMP
csky: Set regs->usp to kernel sp, when the exception is from kernel
csky/mm: Fixup export invalid_pte_table symbol
csky: Separate fixaddr_init from highmem
...
Linus Torvalds [Sun, 23 Feb 2020 02:02:10 +0000 (18:02 -0800)]
Merge tag 'ras-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS fixes from Thomas Gleixner:
"Two fixes for the AMD MCE driver:
- Populate the per CPU MCA bank descriptor pointer only after it has
been completely set up to prevent a use-after-free in case that one
of the subsequent initialization step fails
- Implement a proper release function for the sysfs entries of MCA
threshold controls instead of freeing the memory right in the CPU
teardown code, which leads to another use-after-free when the
associated sysfs file is opened and accessed"
* tag 'ras-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce/amd: Fix kobject lifetime
x86/mce/amd: Publish the bank pointer only after setup has succeeded
Linus Torvalds [Sun, 23 Feb 2020 01:25:46 +0000 (17:25 -0800)]
Merge tag 'irq-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"Two fixes for the irq core code which are follow ups to the recent MSI
fixes:
- The WARN_ON which was put into the MSI setaffinity callback for
paranoia reasons actually triggered via a callchain which escaped
when all the possible ways to reach that code were analyzed.
The proc/irq/$N/*affinity interfaces have a quirk which came in
when ALPHA moved to the generic interface: In case that the written
affinity mask does not contain any online CPU it calls into ALPHAs
magic auto affinity setting code.
A few years later this mechanism was also made available to x86 for
no good reasons and in a way which circumvents all sanity checks
for interrupts which cannot have their affinity set from process
context on X86 due to the way the X86 interrupt delivery works.
It would be possible to make this work properly, but there is no
point in doing so. If the interrupt is not yet started then the
affinity setting has no effect and if it is started already then it
is already assigned to an online CPU so there is no point to
randomly move it to some other CPU. Just return EINVAL as the code
has done before that change forever.
- The new MSI quirk bit in the irq domain flags turned out to be
already occupied, which escaped the author and the reviewers
because the already in use bits were 0,6,2,3,4,5 listed in that
order.
That bit 6 was simply overlooked because the ordering was straight
forward linear otherwise. So the new bit ended up being a
duplicate.
Fix it up by switching the oddball 6 to the obvious 1"
* tag 'irq-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/irqdomain: Make sure all irq domain flags are distinct
genirq/proc: Reject invalid affinity masks (again)
Linus Torvalds [Sun, 23 Feb 2020 01:08:16 +0000 (17:08 -0800)]
Merge tag 'x86-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"Two fixes for x86:
- Remove the __force_oder definiton from the kaslr boot code as it is
already defined in the page table code which makes GCC 10 builds
fail because it changed the default to -fno-common.
- Address the AMD erratum 1054 concerning the IRPERF capability and
enable the Instructions Retired fixed counter on machines which are
not affected by the erratum"
* tag 'x86-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/amd: Enable the fixed Instructions Retired counter IRPERF
x86/boot/compressed: Don't declare __force_order in kaslr_64.c
Linus Torvalds [Sat, 22 Feb 2020 19:12:55 +0000 (11:12 -0800)]
Merge tag 'io_uring-5.6-2020-02-22' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Here's a small collection of fixes that were queued up:
- Remove unnecessary NULL check (Dan)
- Missing io_req_cancelled() call in fallocate (Pavel)
- Put the cleanup check for aux data in the right spot (Pavel)
- Two fixes for SQPOLL (Stefano, Xiaoguang)"
* tag 'io_uring-5.6-2020-02-22' of git://git.kernel.dk/linux-block:
io_uring: fix __io_iopoll_check deadlock in io_sq_thread
io_uring: prevent sq_thread from spinning when it should stop
io_uring: fix use-after-free by io_cleanup_req()
io_uring: remove unnecessary NULL checks
io_uring: add missing io_req_cancelled()
Linus Torvalds [Sat, 22 Feb 2020 19:09:06 +0000 (11:09 -0800)]
Merge tag 'block-5.6-2020-02-22' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Just a set of NVMe fixes via Keith"
* tag 'block-5.6-2020-02-22' of git://git.kernel.dk/linux-block:
nvme-multipath: Fix memory leak with ana_log_buf
nvme: Fix uninitialized-variable warning
nvme-pci: Use single IRQ vector for old Apple models
nvme/pci: Add sleep quirk for Samsung and Toshiba drives
Linus Torvalds [Sat, 22 Feb 2020 19:00:52 +0000 (11:00 -0800)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Four non-core fixes.
Two are reverts of target fixes which turned out to have unwanted side
effects, one is a revert of an RDMA fix with the same problem and the
final one fixes an incorrect warning about memory allocation failures
in megaraid_sas (the driver actually reduces the allocation size until
it succeeds)"
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: Revert "target: iscsi: Wait for all commands to finish before freeing a session"
scsi: Revert "RDMA/isert: Fix a recently introduced regression related to logout"
scsi: megaraid_sas: silence a warning
scsi: Revert "target/core: Inline transport_lun_remove_cmd()"
Linus Torvalds [Sat, 22 Feb 2020 18:43:41 +0000 (10:43 -0800)]
Merge tag 's390-5.6-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Vasily Gorbik:
- Remove ieee_emulation_warnings sysctl which is a dead code.
- Avoid triggering rebuild of the kernel during make install.
- Enable protected virtualization guest support in default configs.
- Fix cio_ignore seq_file .next function to increase position index.
And use kobj_to_dev instead of container_of in cio code.
- Fix storage block address lists to contain absolute addresses in qdio
code.
- Few clang warnings and spelling fixes.
* tag 's390-5.6-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/qdio: fill SBALEs with absolute addresses
s390/qdio: fill SL with absolute addresses
s390: remove obsolete ieee_emulation_warnings
s390: make 'install' not depend on vmlinux
s390/kaslr: Fix casts in get_random
s390/mm: Explicitly compare PAGE_DEFAULT_KEY against zero in storage_key_init_range
s390/pkey/zcrypt: spelling s/crytp/crypt/
s390/cio: use kobj_to_dev() API
s390/defconfig: enable CONFIG_PROTECTED_VIRTUALIZATION_GUEST
s390/cio: cio_ignore_proc_seq_next should increase position index
Xiaoguang Wang [Sat, 22 Feb 2020 06:46:05 +0000 (14:46 +0800)]
io_uring: fix __io_iopoll_check deadlock in io_sq_thread
Since commit a3a0e43fd770 ("io_uring: don't enter poll loop if we have
CQEs pending"), if we already events pending, we won't enter poll loop.
In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has
been terminated and don't reap pending events which are already in cq
ring, and there are some reqs in poll_list, io_sq_thread will enter
__io_iopoll_check(), and find pending events, then return, this loop
will never have a chance to exit.
I have seen this issue in fio stress tests, to fix this issue, let
io_sq_thread call io_iopoll_getevents() with argument 'min' being zero,
and remove __io_iopoll_check().
Fixes: a3a0e43fd770 ("io_uring: don't enter poll loop if we have CQEs pending") Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Fri, 21 Feb 2020 10:08:35 +0000 (11:08 +0100)]
ext4: fix mount failure with quota configured as module
When CONFIG_QFMT_V2 is configured as a module, the test in
ext4_feature_set_ok() fails and so mount of filesystems with quota or
project features fails. Fix the test to use IS_ENABLED macro which
works properly even for modules.
Link: https://lore.kernel.org/r/20200221100835.9332-1-jack@suse.cz Fixes: d65d87a07476 ("ext4: improve explanation of a mount failure caused by a misconfigured kernel") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
wangyan [Thu, 20 Feb 2020 13:46:14 +0000 (21:46 +0800)]
jbd2: fix ocfs2 corrupt when clearing block group bits
I found a NULL pointer dereference in ocfs2_block_group_clear_bits().
The running environment:
kernel version: 4.19
A cluster with two nodes, 5 luns mounted on two nodes, and do some
file operations like dd/fallocate/truncate/rm on every lun with storage
network disconnection.
The fallocate operation on dm-23-45 caused an null pointer dereference.
My analysis process as follows:
ocfs2_fallocate
__ocfs2_change_file_space
ocfs2_allocate_extents
ocfs2_extend_allocation
ocfs2_add_inode_data
ocfs2_add_clusters_in_btree
ocfs2_insert_extent
ocfs2_do_insert_extent
ocfs2_rotate_tree_right
ocfs2_extend_rotate_transaction
ocfs2_extend_trans
jbd2_journal_restart
jbd2__journal_restart
/* handle->h_transaction is NULL,
* is_handle_aborted(handle) is true
*/
handle->h_transaction = NULL;
start_this_handle
return -EROFS;
ocfs2_free_clusters
_ocfs2_free_clusters
_ocfs2_free_suballoc_bits
ocfs2_block_group_clear_bits
ocfs2_journal_access_gd
__ocfs2_journal_access
jbd2_journal_get_undo_access
/* I think jbd2_write_access_granted() will
* return true, because do_get_write_access()
* will return -EROFS.
*/
if (jbd2_write_access_granted(...)) return 0;
do_get_write_access
/* handle->h_transaction is NULL, it will
* return -EROFS here, so do_get_write_access()
* was not called.
*/
if (is_handle_aborted(handle)) return -EROFS;
/* bh2jh(group_bh) is NULL, caused NULL
pointer dereference */
undo_bg = (struct ocfs2_group_desc *)
bh2jh(group_bh)->b_committed_data;
If handle->h_transaction == NULL, then jbd2_write_access_granted()
does not really guarantee that journal_head will stay around,
not even speaking of its b_committed_data. The bh2jh(group_bh)
can be removed after ocfs2_journal_access_gd() and before call
"bh2jh(group_bh)->b_committed_data". So, we should move
is_handle_aborted() check from do_get_write_access() into
jbd2_journal_get_undo_access() and jbd2_journal_get_write_access()
before the call to jbd2_write_access_granted().
Eric Biggers [Wed, 19 Feb 2020 18:30:47 +0000 (10:30 -0800)]
ext4: fix race between writepages and enabling EXT4_EXTENTS_FL
If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running
on it, the following warning in ext4_add_complete_io() can be hit:
WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120
Here's a minimal reproducer (not 100% reliable) (root isn't required):
while true; do
sync
done &
while true; do
rm -f file
touch file
chattr -e file
echo X >> file
chattr +e file
done
The problem is that in ext4_writepages(), ext4_should_dioread_nolock()
(which only returns true on extent-based files) is checked once to set
the number of reserved journal credits, and also again later to select
the flags for ext4_map_blocks() and copy the reserved journal handle to
ext4_io_end::handle. But if EXT4_EXTENTS_FL is being concurrently set,
the first check can see dioread_nolock disabled while the later one can
see it enabled, causing the reserved handle to unexpectedly be NULL.
Since changing EXT4_EXTENTS_FL is uncommon, and there may be other races
related to doing so as well, fix this by synchronizing changing
EXT4_EXTENTS_FL with ext4_writepages() via the existing
s_writepages_rwsem (previously called s_journal_flag_rwsem).
This was originally reported by syzbot without a reproducer at
https://syzkaller.appspot.com/bug?extid=2202a584a00fffd19fbf,
but now that dioread_nolock is the default I also started seeing this
when running syzkaller locally.
Link: https://lore.kernel.org/r/20200219183047.47417-3-ebiggers@kernel.org Reported-by: syzbot+2202a584a00fffd19fbf@syzkaller.appspotmail.com Fixes: 6b523df4fb5a ("ext4: use transaction reservation for extent conversion in ext4_end_io") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org
Eric Biggers [Wed, 19 Feb 2020 18:30:46 +0000 (10:30 -0800)]
ext4: rename s_journal_flag_rwsem to s_writepages_rwsem
In preparation for making s_journal_flag_rwsem synchronize
ext4_writepages() with changes to both the EXTENTS and JOURNAL_DATA
flags (rather than just JOURNAL_DATA as it does currently), rename it to
s_writepages_rwsem.
ext4: fix potential race between s_flex_groups online resizing and access
During an online resize an array of s_flex_groups structures gets replaced
so it can get enlarged. If there is a concurrent access to the array and
this memory has been reused then this can lead to an invalid memory access.
The s_flex_group array has been converted into an array of pointers rather
than an array of structures. This is to ensure that the information
contained in the structures cannot get out of sync during a resize due to
an accessor updating the value in the old structure after it has been
copied but before the array pointer is updated. Since the structures them-
selves are no longer copied but only the pointers to them this case is
mitigated.
Linus Torvalds [Sat, 22 Feb 2020 00:10:10 +0000 (16:10 -0800)]
Merge tag 'for-linus-5.6-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Two small fixes for Xen:
- a fix to avoid warnings with new gcc
- a fix for incorrectly disabled interrupts when calling
_cond_resched()"
* tag 'for-linus-5.6-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: Enable interrupts when calling _cond_resched()
x86/xen: Distribute switch variables for initialization
Linus Torvalds [Sat, 22 Feb 2020 00:03:36 +0000 (16:03 -0800)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"It's all straightforward apart from the changes to mmap()/mremap() in
relation to their handling of address arguments from userspace with
non-zero tag bits in the upper byte.
The change to brk() is necessary to fix a nasty user-visible
regression in malloc(), but we tightened up mmap() and mremap() at the
same time because they also allow the user to create virtual aliases
by accident. It's much less likely than brk() to matter in practice,
but enforcing the principle of "don't permit the creation of mappings
using tagged addresses" leads to a straightforward ABI without having
to worry about the "but what if a crazy program did foo?" aspect of
things.
Summary:
- Fix regression in malloc() caused by ignored address tags in brk()
- Add missing brackets around argument to untagged_addr() macro
- Fix clang build when using binutils assembler
- Fix silly typo in virtual memory map documentation"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
mm: Avoid creating virtual address aliases in brk()/mmap()/mremap()
docs: arm64: fix trivial spelling enought to enough in memory.rst
arm64: memory: Add missing brackets to untagged_addr() macro
arm64: lse: Fix LSE atomics with LLVM
Linus Torvalds [Fri, 21 Feb 2020 23:57:56 +0000 (15:57 -0800)]
Merge tag 'powerpc-5.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Some more powerpc fixes for 5.6. This is two weeks worth as I was out
sick last week:
- Three fixes for the recently added VMAP_STACK on 32-bit.
- Three fixes related to hugepages on 8xx (32-bit).
- A fix for a bug in our transactional memory handling that could
lead to a kernel crash if we saw a page fault during signal
delivery.
- A fix for a deadlock in our PCI EEH (Enhanced Error Handling) code.
- A couple of other minor fixes.
Thanks to: Christophe Leroy, Erhard F, Frederic Barrat, Gustavo Luiz
Duarte, Larry Finger, Leonardo Bras, Oliver O'Halloran, Sam Bobroff"
* tag 'powerpc-5.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/entry: Fix an #if which should be an #ifdef in entry_32.S
powerpc/xmon: Fix whitespace handling in getstring()
powerpc/6xx: Fix power_save_ppc32_restore() with CONFIG_VMAP_STACK
powerpc/chrp: Fix enter_rtas() with CONFIG_VMAP_STACK
powerpc/32s: Fix DSI and ISI exceptions for CONFIG_VMAP_STACK
powerpc/tm: Fix clearing MSR[TS] in current when reclaiming on signal delivery
powerpc/8xx: Fix clearing of bits 20-23 in ITLB miss
powerpc/hugetlb: Fix 8M hugepages on 8xx
powerpc/hugetlb: Fix 512k hugepages on 8xx with 16k page size
powerpc/eeh: Fix deadlock handling dead PHB
Linus Torvalds [Fri, 21 Feb 2020 21:02:49 +0000 (13:02 -0800)]
Merge tag 'linux-watchdog-5.6-rc3' of git://www.linux-watchdog.org/linux-watchdog
Pull watchdog fixes from Wim Van Sebroeck:
- mtk_wdt needs RESET_CONTROLLER to build
- da9062 driver fixes:
- fix power management ops
- do not ping the hw during stop()
- add dependency on I2C
* tag 'linux-watchdog-5.6-rc3' of git://www.linux-watchdog.org/linux-watchdog:
watchdog: da9062: Add dependency on I2C
watchdog: da9062: fix power management ops
watchdog: da9062: do not ping the hw during stop()
watchdog: fix mtk_wdt.c RESET_CONTROLLER build error
Linus Torvalds [Fri, 21 Feb 2020 20:57:05 +0000 (12:57 -0800)]
Merge tag 'char-misc-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are some small char/misc driver fixes for 5.6-rc3.
Also included in here are some updates for some documentation files
that I seem to be maintaining these days.
The driver fixes are:
- small fixes for the habanalabs driver
- fsi driver bugfix
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
Documentation/process: Swap out the ambassador for Canonical
habanalabs: patched cb equals user cb in device memset
habanalabs: do not halt CoreSight during hard reset
habanalabs: halt the engines before hard-reset
MAINTAINERS: remove unnecessary ':' characters
fsi: aspeed: add unspecified HAS_IOMEM dependency
COPYING: state that all contributions really are covered by this file
Documentation/process: Change Microsoft contact for embargoed hardware issues
embargoed-hardware-issues: drop Amazon contact as the email address now bounces
Documentation/process: Add Arm contact for embargoed HW issues
Linus Torvalds [Fri, 21 Feb 2020 20:53:53 +0000 (12:53 -0800)]
Merge tag 'staging-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging driver fixes from Greg KH:
"Here are some small staging driver fixes for 5.6-rc3, along with the
removal of an unused/unneeded driver as well.
The android vsoc driver is not needed anymore by anyone, so it was
removed.
The other driver fixes are:
- ashmem bugfixes
- greybus audio driver bugfix
- wireless driver bugfixes and tiny cleanups to error paths
All of these have been in linux-next for a while now with no reported
issues"
* tag 'staging-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: rtl8723bs: Remove unneeded goto statements
staging: rtl8188eu: Remove some unneeded goto statements
staging: rtl8723bs: Fix potential overuse of kernel memory
staging: rtl8188eu: Fix potential overuse of kernel memory
staging: rtl8723bs: Fix potential security hole
staging: rtl8188eu: Fix potential security hole
staging: greybus: use after free in gb_audio_manager_remove_all()
staging: android: Delete the 'vsoc' driver
staging: rtl8723bs: fix copy of overlapping memory
staging: android: ashmem: Disallow ashmem memory from being remapped
staging: vt6656: fix sign of rx_dbm to bb_pre_ed_rssi.
Linus Torvalds [Fri, 21 Feb 2020 20:48:29 +0000 (12:48 -0800)]
Merge tag 'tty-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver fixes from Greg KH:
"Here are a number of small tty and serial driver fixes for 5.6-rc3
that resolve a bunch of reported issues.
They are:
- vt selection and ioctl fixes
- serdev bugfix
- atmel serial driver fixes
- qcom serial driver fixes
- other minor serial driver fixes
All of these have been in linux-next for a while with no reported
issues"
* tag 'tty-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
vt: selection, close sel_buffer race
vt: selection, handle pending signals in paste_selection
serial: cpm_uart: call cpm_muram_init before registering console
tty: serial: qcom_geni_serial: Fix RX cancel command failure
serial: 8250: Check UPF_IRQ_SHARED in advance
tty: serial: imx: setup the correct sg entry for tx dma
vt: vt_ioctl: fix race in VT_RESIZEX
vt: fix scrollback flushing on background consoles
tty: serial: tegra: Handle RX transfer in PIO mode if DMA wasn't started
tty/serial: atmel: manage shutdown in case of RS485 or ISO7816 mode
serdev: ttyport: restore client ops on deregistration
serial: ar933x_uart: set UART_CS_{RX,TX}_READY_ORIDE
Linus Torvalds [Fri, 21 Feb 2020 20:44:53 +0000 (12:44 -0800)]
Merge tag 'usb-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB/Thunderbolt fixes from Greg KH:
"Here are a number of small USB driver fixes for 5.6-rc3.
Included in here are:
- MAINTAINER file updates
- USB gadget driver fixes
- usb core quirk additions and fixes for regressions
- xhci driver fixes
- usb serial driver id additions and fixes
- thunderbolt bugfix
Thunderbolt patches come in through here now that USB4 is really
thunderbolt.
All of these have been in linux-next for a while with no reported
issues"
* tag 'usb-5.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (34 commits)
USB: misc: iowarrior: add support for the 100 device
thunderbolt: Prevent crash if non-active NVMem file is read
usb: gadget: udc-xilinx: Fix xudc_stop() kernel-doc format
USB: misc: iowarrior: add support for the 28 and 28L devices
USB: misc: iowarrior: add support for 2 OEMed devices
USB: Fix novation SourceControl XL after suspend
xhci: Fix memory leak when caching protocol extended capability PSI tables - take 2
Revert "xhci: Fix memory leak when caching protocol extended capability PSI tables"
MAINTAINERS: Sort entries in database for THUNDERBOLT
usb: dwc3: debug: fix string position formatting mixup with ret and len
usb: gadget: serial: fix Tx stall after buffer overflow
usb: gadget: ffs: ffs_aio_cancel(): Save/restore IRQ flags
usb: dwc2: Fix SET/CLEAR_FEATURE and GET_STATUS flows
usb: dwc2: Fix in ISOC request length checking
usb: gadget: composite: Support more than 500mA MaxPower
usb: gadget: composite: Fix bMaxPower for SuperSpeedPlus
usb: gadget: u_audio: Fix high-speed max packet size
usb: dwc3: gadget: Check for IOC/LST bit in TRB->ctrl fields
USB: core: clean up endpoint-descriptor parsing
USB: quirks: blacklist duplicate ep on Sound Devices USBPre2
...
Linus Torvalds [Fri, 21 Feb 2020 20:18:02 +0000 (12:18 -0800)]
Merge tag 'drm-fixes-2020-02-21' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Varied fixes for rc3.
i915 is the largest, they are seeing some ACPI problems with their CI
which hopefully get solved soon [1].
msm has a bunch of fixes for new hw added in the merge, a bunch of
amdgpu fixes, and nouveau adds support for some new firmwares for
turing tu11x GPUs that were just released into linux-firmware by
nvidia, they operate the same as the ones we already have for tu10x so
should be fine to hook up.
Otherwise it's just misc fixes for panfrost and sun4i.
core:
- Allow only one rotation argument, and allow zero rotation in video
cmdline.
i915:
- Workaround missing Display Stream Compression (DSC) state readout
by forcing modeset when its enabled at probe
- Fix EHL port clock voltage level requirements
- Fix queuing retire workers on the virtual engine
- Fix use of partially initialized waiters
- Stop using drm_pci_alloc/drm_pci/free
- Fix rewind of RING_TAIL by forcing a context reload
- Fix locking on resetting ring->head
- Propagate our bug filing URL change to stable kernels
panfrost:
- Small compiler warning fix for panfrost.
- Fix when using performance counters in panfrost when using per fd
address space.
sun4xi:
- Fix dt binding
nouveau:
- tu11x modesetting fix
- ACR/GR firmware support for tu11x (fw is public now)
msm:
- fix UBWC on GPU and display side for sc7180
- fix DSI suspend/resume issue encountered on sc7180
- fix some breakage on so called "linux-android" devices
(fallout from sc7180/a618 support, not seen earlier due to
bootloader/firmware differences)
- couple other misc fixes
[1] The Intel suspend testing should now be fixed by commit 63fb9623427f
("ACPI: PM: s2idle: Check fixed wakeup events in acpi_s2idle_wake()")
* tag 'drm-fixes-2020-02-21' of git://anongit.freedesktop.org/drm/drm: (39 commits)
drm/amdgpu/display: clean up hdcp workqueue handling
drm/amdgpu: add is_raven_kicker judgement for raven1
drm/i915/gt: Avoid resetting ring->head outside of its timeline mutex
drm/i915/execlists: Always force a context reload when rewinding RING_TAIL
drm/i915: Wean off drm_pci_alloc/drm_pci_free
drm/i915/gt: Protect defer_request() from new waiters
drm/i915/gt: Prevent queuing retire workers on the virtual engine
drm/i915/dsc: force full modeset whenever DSC is enabled at probe
drm/i915/ehl: Update port clock voltage level requirements
drm/i915: Update drm/i915 bug filing URL
MAINTAINERS: Update drm/i915 bug filing URL
drm/i915: Initialise basic fence before acquiring seqno
drm/i915/gem: Require per-engine reset support for non-persistent contexts
drm/nouveau/kms/gv100-: Re-set LUT after clearing for modesets
drm/nouveau/gr/tu11x: initial support
drm/nouveau/acr/tu11x: initial support
drm/amdgpu/gfx10: disable gfxoff when reading rlc clock
drm/amdgpu/gfx9: disable gfxoff when reading rlc clock
drm/amdgpu/soc15: fix xclk for raven
drm/amd/powerplay: always refetch the enabled features status on dpm enablement
...
1) Limit xt_hashlimit hash table size to avoid OOM or hung tasks, from
Cong Wang.
2) Fix deadlock in xsk by publishing global consumer pointers when NAPI
is finished, from Magnus Karlsson.
3) Set table field properly to RT_TABLE_COMPAT when necessary, from
Jethro Beekman.
4) NLA_STRING attributes are not necessary NULL terminated, deal wiht
that in IFLA_ALT_IFNAME. From Eric Dumazet.
5) Fix checksum handling in atlantic driver, from Dmitry Bezrukov.
6) Handle mtu==0 devices properly in wireguard, from Jason A.
Donenfeld.
7) Fix several lockdep warnings in bonding, from Taehee Yoo.
8) Fix cls_flower port blocking, from Jason Baron.
9) Sanitize internal map names in libbpf, from Toke Høiland-Jørgensen.
10) Fix RDMA race in qede driver, from Michal Kalderon.
11) Fix several false lockdep warnings by adding conditions to
list_for_each_entry_rcu(), from Madhuparna Bhowmik.
12) Fix sleep in atomic in mlx5 driver, from Huy Nguyen.
13) Fix potential deadlock in bpf_map_do_batch(), from Yonghong Song.
14) Hey, variables declared in switch statement before any case
statements are not initialized. I learn something every day. Get
rids of this stuff in several parts of the networking, from Kees
Cook.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
bnxt_en: Issue PCIe FLR in kdump kernel to cleanup pending DMAs.
bnxt_en: Improve device shutdown method.
net: netlink: cap max groups which will be considered in netlink_bind()
net: thunderx: workaround BGX TX Underflow issue
ionic: fix fw_status read
net: disable BRIDGE_NETFILTER by default
net: macb: Properly handle phylink on at91rm9200
s390/qeth: fix off-by-one in RX copybreak check
s390/qeth: don't warn for napi with 0 budget
s390/qeth: vnicc Fix EOPNOTSUPP precedence
openvswitch: Distribute switch variables for initialization
net: ip6_gre: Distribute switch variables for initialization
net: core: Distribute switch variables for initialization
udp: rehash on disconnect
net/tls: Fix to avoid gettig invalid tls record
bpf: Fix a potential deadlock with bpf_map_do_batch
bpf: Do not grab the bucket spinlock by default on htab batch ops
ice: Wait for VF to be reset/ready before configuration
ice: Don't tell the OS that link is going down
ice: Don't reject odd values of usecs set by user
...
Linus Torvalds [Fri, 21 Feb 2020 19:40:10 +0000 (11:40 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
- A few y2038 fixes which missed the merge window while dependencies
in NFS were being sorted out.
- A bunch of fixes. Some minor, some not.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
MAINTAINERS: use tabs for SAFESETID
lib/stackdepot.c: fix global out-of-bounds in stack_slabs
mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM
mm/vmscan.c: don't round up scan size for online memory cgroup
lib/string.c: update match_string() doc-strings with correct behavior
mm/memcontrol.c: lost css_put in memcg_expand_shrinker_maps()
mm/swapfile.c: fix a comment in sys_swapon()
scripts/get_maintainer.pl: deprioritize old Fixes: addresses
get_maintainer: remove uses of P: for maintainer name
selftests/vm: add missed tests in run_vmtests
include/uapi/linux/swab.h: fix userspace breakage, use __BITS_PER_LONG for swap
Revert "ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()"
y2038: hide timeval/timespec/itimerval/itimerspec types
y2038: remove unused time32 interfaces
y2038: remove ktime to/from timespec/timeval conversion
Alexander Potapenko [Fri, 21 Feb 2020 04:04:30 +0000 (20:04 -0800)]
lib/stackdepot.c: fix global out-of-bounds in stack_slabs
Walter Wu has reported a potential case in which init_stack_slab() is
called after stack_slabs[STACK_ALLOC_MAX_SLABS - 1] has already been
initialized. In that case init_stack_slab() will overwrite
stack_slabs[STACK_ALLOC_MAX_SLABS], which may result in a memory
corruption.
Link: http://lkml.kernel.org/r/20200218102950.260263-1-glider@google.com Fixes: cd11016e5f521 ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB") Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: Walter Wu <walter-zh.wu@mediatek.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On x86 the impact is limited to x86_32 builds, or x86_64 configurations
that override the default setting for SPARSEMEM_VMEMMAP.
Other memory hotplug archs (arm64, ia64, and ppc) also default to
SPARSEMEM_VMEMMAP=y.
[dan.j.williams@intel.com: changelog update]
{rppt@linux.ibm.com: changelog update] Link: http://lkml.kernel.org/r/20200219030454.4844-1-bhe@redhat.com Fixes: ba72b4c8cf60 ("mm/sparsemem: support sub-section hotplug") Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gavin Shan [Fri, 21 Feb 2020 04:04:24 +0000 (20:04 -0800)]
mm/vmscan.c: don't round up scan size for online memory cgroup
Commit 68600f623d69 ("mm: don't miss the last page because of round-off
error") makes the scan size round up to @denominator regardless of the
memory cgroup's state, online or offline. This affects the overall
reclaiming behavior: the corresponding LRU list is eligible for
reclaiming only when its size logically right shifted by @sc->priority
is bigger than zero in the former formula.
For example, the inactive anonymous LRU list should have at least 0x4000
pages to be eligible for reclaiming when we have 60/12 for
swappiness/priority and without taking scan/rotation ratio into account.
After the roundup is applied, the inactive anonymous LRU list becomes
eligible for reclaiming when its size is bigger than or equal to 0x1000
in the same condition.
aarch64 has 512MB huge page size when the base page size is 64KB. The
memory cgroup that has a huge page is always eligible for reclaiming in
that case.
The reclaiming is likely to stop after the huge page is reclaimed,
meaing the further iteration on @sc->priority and the silbing and child
memory cgroups will be skipped. The overall behaviour has been changed.
This fixes the issue by applying the roundup to offlined memory cgroups
only, to give more preference to reclaim memory from offlined memory
cgroup. It sounds reasonable as those memory is unlikedly to be used by
anyone.
The issue was found by starting up 8 VMs on a Ampere Mustang machine,
which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and
2GB memory. It took 264 seconds for all VMs to be completely up and
784MB swap is consumed after that. With this patch applied, it took 236
seconds and 60MB swap to do same thing. So there is 10% performance
improvement for my case. Note that KSM is disable while THP is enabled
in the testing.
total used free shared buff/cache available
Mem: 16196 10065 2049 16 4081 3749
Swap: 8175 784 7391
total used free shared buff/cache available
Mem: 16196 11324 3656 24 1215 2936
Swap: 8175 60 8115
Link: http://lkml.kernel.org/r/20200211024514.8730-1-gshan@redhat.com Fixes: 68600f623d69 ("mm: don't miss the last page because of round-off error") Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: <stable@vger.kernel.org> [4.20+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexandru Ardelean [Fri, 21 Feb 2020 04:04:21 +0000 (20:04 -0800)]
lib/string.c: update match_string() doc-strings with correct behavior
There were a few attempts at changing behavior of the match_string()
helpers (i.e. 'match_string()' & 'sysfs_match_string()'), to change &
extend the behavior according to the doc-string.
But the simplest approach is to just fix the doc-strings. The current
behavior is fine as-is, and some bugs were introduced trying to fix it.
As for extending the behavior, new helpers can always be introduced if
needed.
The match_string() helpers behave more like 'strncmp()' in the sense
that they go up to n elements or until the first NULL element in the
array of strings.
This change updates the doc-strings with this info.
Link: http://lkml.kernel.org/r/20200213072722.8249-1-alexandru.ardelean@analog.com Signed-off-by: Alexandru Ardelean <alexandru.ardelean@analog.com> Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: "Tobin C . Harding" <tobin@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Douglas Anderson [Fri, 21 Feb 2020 04:04:12 +0000 (20:04 -0800)]
scripts/get_maintainer.pl: deprioritize old Fixes: addresses
Recently, I found that get_maintainer was causing me to send emails to
the old addresses for maintainers. Since I usually just trust the
output of get_maintainer to know the right email address, I didn't even
look carefully and fired off two patch series that went to the wrong
place. Oops.
The problem was introduced recently when trying to add signatures from
Fixes. The problem was that these email addresses were added too early
in the process of compiling our list of places to send. Things added to
the list earlier are considered more canonical and when we later added
maintainer entries we ended up deduplicating to the old address.
Here are two examples using mainline commits (to make it easier to
replicate) for the two maintainers that I messed up recently:
Joe Perches [Fri, 21 Feb 2020 04:04:09 +0000 (20:04 -0800)]
get_maintainer: remove uses of P: for maintainer name
Commit 1ca84ed6425f ("MAINTAINERS: Reclaim the P: tag for Maintainer
Entry Profile") changed the use of the "P:" tag from "Person" to
"Profile (ie: special subsystem coding styles and characteristics)"
Change how get_maintainer.pl parses the "P:" tag to match.
Link: http://lkml.kernel.org/r/ca53823fc5d25c0be32ad937d0207a0589c08643.camel@perches.com Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Dan Williams <dan.j.william@intel.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SeongJae Park [Fri, 21 Feb 2020 04:04:06 +0000 (20:04 -0800)]
selftests/vm: add missed tests in run_vmtests
The commits introducing 'mlock-random-test'[1], 'map_fiex_noreplace'[2],
and 'thuge-gen'[3] have not added those in the 'run_vmtests' script and
thus the 'run_tests' command of kselftests doesn't run those. This
commit adds those in the script.
'gup_benchmark' and 'transhuge-stress' are also not included in the
'run_vmtests', but this commit does not add those because those are for
performance measurement rather than pass/fail tests.
[1] commit 26b4224d9961 ("selftests: expanding more mlock selftest")
[2] commit 91cbacc34512 ("tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE")
[3] commit fcc1f2d5dd34 ("selftests: add a test program for variable huge page sizes in mmap/shmget")
Christian Borntraeger [Fri, 21 Feb 2020 04:04:03 +0000 (20:04 -0800)]
include/uapi/linux/swab.h: fix userspace breakage, use __BITS_PER_LONG for swap
QEMU has a funny new build error message when I use the upstream kernel
headers:
CC block/file-posix.o
In file included from /home/cborntra/REPOS/qemu/include/qemu/timer.h:4,
from /home/cborntra/REPOS/qemu/include/qemu/timed-average.h:29,
from /home/cborntra/REPOS/qemu/include/block/accounting.h:28,
from /home/cborntra/REPOS/qemu/include/block/block_int.h:27,
from /home/cborntra/REPOS/qemu/block/file-posix.c:30:
/usr/include/linux/swab.h: In function `__swab':
/home/cborntra/REPOS/qemu/include/qemu/bitops.h:20:34: error: "sizeof" is not defined, evaluates to 0 [-Werror=undef]
20 | #define BITS_PER_LONG (sizeof (unsigned long) * BITS_PER_BYTE)
| ^~~~~~
/home/cborntra/REPOS/qemu/include/qemu/bitops.h:20:41: error: missing binary operator before token "("
20 | #define BITS_PER_LONG (sizeof (unsigned long) * BITS_PER_BYTE)
| ^
cc1: all warnings being treated as errors
make: *** [/home/cborntra/REPOS/qemu/rules.mak:69: block/file-posix.o] Error 1
rm tests/qemu-iotests/socket_scm_helper.o
This was triggered by commit d5767057c9a ("uapi: rename ext2_swab() to
swab() and share globally in swab.h"). That patch is doing
#include <asm/bitsperlong.h>
but it uses BITS_PER_LONG.
The kernel file asm/bitsperlong.h provide only __BITS_PER_LONG.
Let us use the __ variant in swap.h
Link: http://lkml.kernel.org/r/20200213142147.17604-1-borntraeger@de.ibm.com Fixes: d5767057c9a ("uapi: rename ext2_swab() to swab() and share globally in swab.h") Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Yury Norov <yury.norov@gmail.com> Cc: Allison Randal <allison@lohutok.net> Cc: Joe Perches <joe@perches.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: William Breathitt Gray <vilhelm.gray@gmail.com> Cc: Torsten Hilbrich <torsten.hilbrich@secunet.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit a97955844807 ("ipc,sem: remove uneeded sem_undo_list lock usage
in exit_sem()") removes a lock that is needed. This leads to a process
looping infinitely in exit_sem() and can also lead to a crash. There is
a reproducer available in [1] and with the commit reverted the issue
does not reproduce anymore.
Using the reproducer found in [1] is fairly easy to reach a point where
one of the child processes is looping infinitely in exit_sem between
for(;;) and if (semid == -1) block, while it's trying to free its last
sem_undo structure which has already been freed by freeary().
Each sem_undo struct is on two lists: one per semaphore set (list_id)
and one per process (list_proc). The list_id list tracks undos by
semaphore set, and the list_proc by process.
Undo structures are removed either by freeary() or by exit_sem(). The
freeary function is invoked when the user invokes a syscall to remove a
semaphore set. During this operation freeary() traverses the list_id
associated with the semaphore set and removes the undo structures from
both the list_id and list_proc lists.
For this case, exit_sem() is called at process exit. Each process
contains a struct sem_undo_list (referred to as "ulp") which contains
the head for the list_proc list. When the process exits, exit_sem()
traverses this list to remove each sem_undo struct. As in freeary(),
whenever a sem_undo struct is removed from list_proc, it is also removed
from the list_id list.
Removing elements from list_id is safe for both exit_sem() and freeary()
due to sem_lock(). Removing elements from list_proc is not safe;
freeary() locks &un->ulp->lock when it performs
list_del_rcu(&un->list_proc) but exit_sem() does not (locking was
removed by commit a97955844807 ("ipc,sem: remove uneeded sem_undo_list
lock usage in exit_sem()").
This can result in the following situation while executing the
reproducer [1] : Consider a child process in exit_sem() and the parent
in freeary() (because of semctl(sid[i], NSEM, IPC_RMID)).
- The list_proc for the child contains the last two undo structs A and
B (the rest have been removed either by exit_sem() or freeary()).
- The semid for A is 1 and semid for B is 2.
- exit_sem() removes A and at the same time freeary() removes B.
- Since A and B have different semid sem_lock() will acquire different
locks for each process and both can proceed.
The bug is that they remove A and B from the same list_proc at the same
time because only freeary() acquires the ulp lock. When exit_sem()
removes A it makes ulp->list_proc.next to point at B and at the same
time freeary() removes B setting B->semid=-1.
At the next iteration of for(;;) loop exit_sem() will try to remove B.
The only way to break from for(;;) is for (&un->list_proc ==
&ulp->list_proc) to be true which is not. Then exit_sem() will check if
B->semid=-1 which is and will continue looping in for(;;) until the
memory for B is reallocated and the value at B->semid is changed.
At that point, exit_sem() will crash attempting to unlink B from the
lists (this can be easily triggered by running the reproducer [1] a
second time).
To prove this scenario instrumentation was added to keep information
about each sem_undo (un) struct that is removed per process and per
semaphore set (sma).
CPU0 CPU1
[caller holds sem_lock(sma for A)] ...
freeary() exit_sem()
... ...
... sem_lock(sma for B)
spin_lock(A->ulp->lock) ...
list_del_rcu(un_A->list_proc) list_del_rcu(un_B->list_proc)
Undo structures A and B have different semid and sem_lock() operations
proceed. However they belong to the same list_proc list and they are
removed at the same time. This results into ulp->list_proc.next
pointing to the address of B which is already removed.
After reverting commit a97955844807 ("ipc,sem: remove uneeded
sem_undo_list lock usage in exit_sem()") the issue was no longer
reproducible.
There are no in-kernel users remaining, but there may still be users that
include linux/time.h instead of sys/time.h from user space, so leave the
types available to user space while hiding them from kernel space.
Only the __kernel_old_* versions of these types remain now.
Rafael J. Wysocki [Fri, 21 Feb 2020 00:46:18 +0000 (01:46 +0100)]
ACPI: PM: s2idle: Check fixed wakeup events in acpi_s2idle_wake()
Commit fdde0ff8590b ("ACPI: PM: s2idle: Prevent spurious SCIs from
waking up the system") overlooked the fact that fixed events can wake
up the system too and broke RTC wakeup from suspend-to-idle as a
result.
Fix this issue by checking the fixed events in acpi_s2idle_wake() in
addition to checking wakeup GPEs and break out of the suspend-to-idle
loop if the status bits of any enabled fixed events are set then.
Fixes: fdde0ff8590b ("ACPI: PM: s2idle: Prevent spurious SCIs from waking up the system") Reported-and-tested-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: 5.4+ <stable@vger.kernel.org> # 5.4+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Temperature labels are not always present. Adjust sysfs attribute
visibility accordingly.
Reported-by: Meelis Roos <mroos@linux.ee> Suggested-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Dr. David Alan Gilbert <linux@treblig.org> Cc: Meelis Roos <mroos@linux.ee> Cc: Dr. David Alan Gilbert <linux@treblig.org> Fixes: 266cd5835947 ("hwmon: (w83627ehf) convert to with_info interface") Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Jens Axboe [Fri, 21 Feb 2020 16:18:00 +0000 (09:18 -0700)]
Merge branch 'nvme-5.6-rc3' of git://git.infradead.org/nvme into block-5.6
Pull NVMe fixes from Keith.
* 'nvme-5.6-rc3' of git://git.infradead.org/nvme:
nvme-multipath: Fix memory leak with ana_log_buf
nvme: Fix uninitialized-variable warning
nvme-pci: Use single IRQ vector for old Apple models
nvme/pci: Add sleep quirk for Samsung and Toshiba drives
Filipe Manana [Thu, 20 Feb 2020 13:29:49 +0000 (13:29 +0000)]
Btrfs: fix deadlock during fast fsync when logging prealloc extents beyond eof
While logging the prealloc extents of an inode during a fast fsync we call
btrfs_truncate_inode_items(), through btrfs_log_prealloc_extents(), while
holding a read lock on a leaf of the inode's root (not the log root, the
fs/subvol root), and then that function locks the file range in the inode's
iotree. This can lead to a deadlock when:
* the fsync is ranged
* the file has prealloc extents beyond eof
* writeback for a range different from the fsync range starts
during the fsync
* the size of the file is not sector size aligned
Because when finishing an ordered extent we lock first a file range and
then try to COW the fs/subvol tree to insert an extent item.
The following diagram shows how the deadlock can happen.
CPU 1 CPU 2
btrfs_sync_file()
--> for range [0, 1MiB)
--> inode has a size of
1MiB and has 1 prealloc
extent beyond the
i_size, starting at offset
4MiB
flushes all delalloc for the
range [0MiB, 1MiB) and waits
for the respective ordered
extents to complete
--> before task at CPU 1 locks the
inode, a write into file range
[1MiB, 2MiB + 1KiB) is made
--> i_size is updated to 2MiB + 1KiB
--> writeback is started for that
range, [1MiB, 2MiB + 4KiB)
--> end offset rounded up to
be sector size aligned
btrfs_log_changed_extents()
btrfs_log_prealloc_extents()
--> does a search on the
inode's root
--> holds a read lock on
leaf X
btrfs_finish_ordered_io()
--> locks range [1MiB, 2MiB + 4KiB)
--> end offset rounded up
to be sector size aligned
--> tries to cow leaf X, through
insert_reserved_file_extent()
--> already locked by the
task at CPU 1
btrfs_truncate_inode_items()
--> gets an i_size of
2MiB + 1KiB, which is
not sector size
aligned
--> tries to lock file
range [2MiB, (u64)-1)
--> the start range
is rounded down
from 2MiB + 1K
to 2MiB to be sector
size aligned
--> but the subrange
[2MiB, 2MiB + 4KiB) is
already locked by
task at CPU 2 which
is waiting to get a
write lock on leaf X
for which we are
holding a read lock
*** deadlock ***
This results in a stack trace like the following, triggered by test case
generic/561 from fstests:
So fix this by making btrfs_truncate_inode_items() not lock the range in
the inode's iotree when the target root is a log root, since it's not
needed to lock the range for log roots as the protection from the inode's
lock and log_mutex are all that's needed.
Fixes: 28553fa992cb28 ("Btrfs: fix race between shrinking truncate and fiemap") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
nvme_mpath_init() is called by nvme_init_identify() which is called in
multiple places (nvme_reset_work(), nvme_passthru_end(), etc). This
means nvme_mpath_init() may be called multiple times before
nvme_mpath_uninit() (which is only called on nvme_free_ctrl()).
When nvme_mpath_init() is called multiple times, it overwrites the
ana_log_buf pointer with a new allocation, thus leaking the previous
allocation.
To fix this, free ana_log_buf before allocating a new one.
Fixes: 0d0b660f214dc490 ("nvme: add ANA support") Cc: <stable@vger.kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
Zenghui Yu [Fri, 21 Feb 2020 02:07:25 +0000 (10:07 +0800)]
genirq/irqdomain: Make sure all irq domain flags are distinct
This was noticed when printing debugfs for MSIs on my ARM64 server. The
new dstate IRQD_MSI_NOMASK_QUIRK came out surprisingly while it should only
be the x86 stuff for the time being...
The new MSI quirk flag uses the same bit as IRQ_DOMAIN_NAME_ALLOCATED which
is oddly defined as bit 6 for no good reason.
Randy Dunlap [Thu, 20 Feb 2020 01:28:21 +0000 (17:28 -0800)]
zonefs: fix documentation typos etc.
Fix typos, spellos, etc. in zonefs.txt.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Damien Le Moal <Damien.LeMoal@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Ma Jun [Sun, 2 Feb 2020 09:56:58 +0000 (17:56 +0800)]
csky: Minimize defconfig to support buildroot config.fragment
Some bsp (eg: buildroot) has defconfig.fragment design to add more
configs into the defconfig in linux source code tree. For example,
we could put different cpu configs into different defconfig.fragments,
but they all use the same defconfig in Linux.
Signed-off-by: Ma Jun <majun258@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Guo Ren [Tue, 8 Oct 2019 06:25:13 +0000 (14:25 +0800)]
csky: Add setup_initrd check code
We should give some necessary check for initrd just like other
architectures and it seems that setup_initrd() could be a common
code for all architectures.