Thomas Gleixner [Sat, 23 Feb 2019 09:53:31 +0000 (10:53 +0100)]
Merge tag 'irqchip-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core
Pull irqchip updates from Marc Zyngier
- Core pseudo-NMI handling code
- Allow the default irq domain to be retrieved
- A new interrupt controller for the Loongson LS1X platform
- Affinity support for the SiFive PLIC
- Better support for the iMX irqsteer driver
- NUMA aware memory allocations for GICv3
- A handful of other fixes (i8259, GICv3, PLIC)
Aisheng Dong [Wed, 20 Feb 2019 11:40:47 +0000 (11:40 +0000)]
irqchip/imx-irqsteer: Change to use reg_num instead of irq_group
One group can manage 64 interrupts by using two registers (e.g. STATUS/SET).
However, the integrated irqsteer may support only 32 interrupts which
needs only one register in a group. But the current driver assume there's
a mininum of two registers in a group which result in a wrong register map
for 32 interrupts per channel irqsteer. Let's use the reg_num caculated by
interrupts per channel instead of irq_group to cover this case.
Cc: Rob Herring <robh+dt@kernel.org> Cc: Shawn Guo <shawnguo@kernel.org> Reviewed-by: Lucas Stach <l.stach@pengutronix.de> Signed-off-by: Dong Aisheng <aisheng.dong@nxp.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Aisheng Dong [Wed, 20 Feb 2019 11:40:40 +0000 (11:40 +0000)]
dt-binding: irq: imx-irqsteer: Use irq number instead of group number
Not all 64 interrupts may be used in one group. e.g. most irqsteer in
imx8qxp and imx8qm subsystems supports only 32 interrupts.
As the IP integration parameters are Channel number and interrupts number,
let's use fsl,irqs-num to represents how many interrupts supported
by this irqsteer channel.
Note this will break the compatibility of old binding. As the original
fsl,irq-groups was born out of a misunderstanding of the HW config
options and we are not aware of any users of the current binding.
And the old binding was just published in recent months, so it's
worth to change now to avoid confusing in the future.
Cc: Rob Herring <robh+dt@kernel.org> Cc: Shawn Guo <shawnguo@kernel.org> Cc: devicetree@vger.kernel.org Reviewed-by: Lucas Stach <l.stach@pengutronix.de> Signed-off-by: Dong Aisheng <aisheng.dong@nxp.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Doug Berger [Wed, 20 Feb 2019 22:15:28 +0000 (14:15 -0800)]
irqchip/brcmstb-l2: Use _irqsave locking variants in non-interrupt code
Using the irq_gc_lock/irq_gc_unlock functions in the suspend and
resume functions creates the opportunity for a deadlock during
suspend, resume, and shutdown. Using the irq_gc_lock_irqsave/
irq_gc_unlock_irqrestore variants prevents this possible deadlock.
Cc: stable@vger.kernel.org Fixes: 7f646e92766e2 ("irqchip: brcmstb-l2: Add Broadcom Set Top Box Level-2 interrupt controller") Signed-off-by: Doug Berger <opendmb@gmail.com> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
[maz: tidied up $SUBJECT] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Shanker Donthineni [Mon, 14 Jan 2019 09:50:19 +0000 (09:50 +0000)]
irqchip/gicv3-its: Use NUMA aware memory allocation for ITS tables
The NUMA node information is visible to ITS driver but not being used
other than handling hardware errata. ITS/GICR hardware accesses to the
local NUMA node is usually quicker than the remote NUMA node. How slow
the remote NUMA accesses are depends on the implementation details.
This patch allocates memory for ITS management tables and command
queue from the corresponding NUMA node using the appropriate NUMA
aware functions. This change improves the performance of the ITS
tables read latency on systems where it has more than one ITS block,
and with the slower inter node accesses.
Apache Web server benchmarking using ab tool on a HiSilicon D06
board with multiple numa mem nodes shows Time per request and
Transfer rate improvements of ~3.6% with this patch.
Marc Zyngier [Wed, 20 Feb 2019 08:59:23 +0000 (08:59 +0000)]
irqdomain: Allow the default irq domain to be retrieved
The default irq domain allows legacy code to create irqdomain
mappings without having to track the domain it is allocating
from. Setting the default domain is a one shot, fire and forget
operation, and no effort was made to be able to retrieve this
information at a later point in time.
Newer irqdomain APIs (the hierarchical stuff) relies on both
the irqchip code to track the irqdomain it is allocating from,
as well as some form of firmware abstraction to easily identify
which piece of HW maps to which irq domain (DT, ACPI).
For systems without such firmware (or legacy platform that are
getting dragged into the 21st century), things are a bit harder.
For these cases (and these cases only!), let's provide a way
to retrieve the default domain, allowing the use of the v2 API
without having to resort to platform-specific hacks.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Anup Patel [Tue, 12 Feb 2019 12:52:46 +0000 (18:22 +0530)]
irqchip/sifive-plic: Implement irq_set_affinity() for SMP host
Currently on SMP host, all CPUs take external interrupts routed via
PLIC. All CPUs will try to claim a given external interrupt but only
one of them will succeed while other CPUs would simply resume whatever
they were doing before. This means if we have N CPUs then for every
external interrupt N-1 CPUs will always fail to claim it and waste
their CPU time.
Instead of above, external interrupts should be taken by only one CPU
and we should have provision to explicitly specify IRQ affinity from
kernel-space or user-space.
This patch provides irq_set_affinity() implementation for PLIC driver.
It also updates irq_enable() such that PLIC interrupts are only enabled
for one of CPUs specified in IRQ affinity mask.
With this patch in-place, we can change IRQ affinity at any-time from
user-space using procfs.
Anup Patel [Tue, 12 Feb 2019 12:52:45 +0000 (18:22 +0530)]
irqchip/sifive-plic: Differentiate between PLIC handler and context
We explicitly differentiate between PLIC handler and context because
PLIC context is for given mode of HART whereas PLIC handler is per-CPU
software construct meant for handling interrupts from a particular
PLIC context.
To achieve this differentiation, we rename "nr_handlers" to "nr_contexts"
and "nr_mapped" to "nr_handlers" in plic_init().
Signed-off-by: Anup Patel <anup@brainfault.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Anup Patel [Tue, 12 Feb 2019 12:52:44 +0000 (18:22 +0530)]
irqchip/sifive-plic: Add warning in plic_init() if handler already present
We have two enteries (one for M-mode and another for S-mode) in the
interrupts-extended DT property of PLIC DT node for each HART. It is
expected that firmware/bootloader will set M-mode HWIRQ line of each
HART to 0xffffffff (i.e. -1) in interrupts-extended DT property
because Linux runs in S-mode only.
If firmware/bootloader is buggy then it will not correctly update
interrupts-extended DT property which might result in a plic_handler
configured twice. This patch adds a warning in plic_init() if a
plic_handler is already marked present. This warning provides us
a hint about incorrectly updated interrupts-extended DT property.
Signed-off-by: Anup Patel <anup@brainfault.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Anup Patel [Tue, 12 Feb 2019 12:52:43 +0000 (18:22 +0530)]
irqchip/sifive-plic: Pre-compute context hart base and enable base
This patch does following optimizations:
1. Pre-compute hart base for each context handler
2. Pre-compute enable base for each context handler
3. Have enable lock for each context handler instead
of global plic_toggle_lock
Signed-off-by: Anup Patel <anup@brainfault.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Thomas Gleixner [Sat, 16 Feb 2019 17:13:12 +0000 (18:13 +0100)]
PCI/MSI: Remove obsolete sanity checks for multiple interrupt sets
Multiple interrupt sets for affinity spreading are now handled in the core
code and the number of sets and their size is recalculated via a driver
supplied callback.
That avoids the requirement to invoke pci_alloc_irq_vectors_affinity() with
the arguments minvecs and maxvecs set to the same value and the callsite
handling the ENOSPC situation.
Remove the now obsolete sanity checks and the related comments.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.778630549@linutronix.de
Ming Lei [Sat, 16 Feb 2019 17:13:10 +0000 (18:13 +0100)]
nvme-pci: Simplify interrupt allocation
The NVME PCI driver contains a tedious mechanism for interrupt
allocation, which is necessary to adjust the number and size of interrupt
sets to the maximum available number of interrupts which depends on the
underlying PCI capabilities and the available CPU resources.
It works around the former short comings of the PCI and core interrupt
allocation mechanims in combination with interrupt sets.
The PCI interrupt allocation function allows to provide a maximum and a
minimum number of interrupts to be allocated and tries to allocate as
many as possible. This worked without driver interaction as long as there
was only a single set of interrupts to handle.
With the addition of support for multiple interrupt sets in the generic
affinity spreading logic, which is invoked from the PCI interrupt
allocation, the adaptive loop in the PCI interrupt allocation did not
work for multiple interrupt sets. The reason is that depending on the
total number of interrupts which the PCI allocation adaptive loop tries
to allocate in each step, the number and the size of the interrupt sets
need to be adapted as well. Due to the way the interrupt sets support was
implemented there was no way for the PCI interrupt allocation code or the
core affinity spreading mechanism to invoke a driver specific function
for adapting the interrupt sets configuration.
As a consequence the driver had to implement another adaptive loop around
the PCI interrupt allocation function and calling that with maximum and
minimum interrupts set to the same value. This ensured that the
allocation either succeeded or immediately failed without any attempt to
adjust the number of interrupts in the PCI code.
The core code now allows drivers to provide a callback to recalculate the
number and the size of interrupt sets during PCI interrupt allocation,
which in turn allows the PCI interrupt allocation function to be called
in the same way as with a single set of interrupts. The PCI code handles
the adaptive loop and the interrupt affinity spreading mechanism invokes
the driver callback to adapt the interrupt set configuration to the
current loop value. This replaces the adaptive loop in the driver
completely.
Implement the NVME specific callback which adjusts the interrupt sets
configuration and remove the adaptive allocation loop.
[ tglx: Simplify the callback further and restore the dropped adjustment of
number of sets ]
Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.602546658@linutronix.de
Ming Lei [Sat, 16 Feb 2019 17:13:09 +0000 (18:13 +0100)]
genirq/affinity: Add new callback for (re)calculating interrupt sets
The interrupt affinity spreading mechanism supports to spread out
affinities for one or more interrupt sets. A interrupt set contains one or
more interrupts. Each set is mapped to a specific functionality of a
device, e.g. general I/O queues and read I/O queus of multiqueue block
devices.
The number of interrupts per set is defined by the driver. It depends on
the total number of available interrupts for the device, which is
determined by the PCI capabilites and the availability of underlying CPU
resources, and the number of queues which the device provides and the
driver wants to instantiate.
The driver passes initial configuration for the interrupt allocation via a
pointer to struct irq_affinity.
Right now the allocation mechanism is complex as it requires to have a loop
in the driver to determine the maximum number of interrupts which are
provided by the PCI capabilities and the underlying CPU resources. This
loop would have to be replicated in every driver which wants to utilize
this mechanism. That's unwanted code duplication and error prone.
In order to move this into generic facilities it is required to have a
mechanism, which allows the recalculation of the interrupt sets and their
size, in the core code. As the core code does not have any knowledge about the
underlying device, a driver specific callback is required in struct
irq_affinity, which can be invoked by the core code. The callback gets the
number of available interupts as an argument, so the driver can calculate the
corresponding number and size of interrupt sets.
At the moment the struct irq_affinity pointer which is handed in from the
driver and passed through to several core functions is marked 'const', but for
the callback to be able to modify the data in the struct it's required to
remove the 'const' qualifier.
Add the optional callback to struct irq_affinity, which allows drivers to
recalculate the number and size of interrupt sets and remove the 'const'
qualifier.
For simple invocations, which do not supply a callback, a default callback
is installed, which just sets nr_sets to 1 and transfers the number of
spreadable vectors to the set_size array at index 0.
This is for now guarded by a check for nr_sets != 0 to keep the NVME driver
working until it is converted to the callback mechanism.
To make sure that the driver configuration is correct under all circumstances
the callback is invoked even when there are no interrupts for queues left,
i.e. the pre/post requirements already exhaust the numner of available
interrupts.
At the PCI layer irq_create_affinity_masks() has to be invoked even for the
case where the legacy interrupt is used. That ensures that the callback is
invoked and the device driver can adjust to that situation.
[ tglx: Fixed the simple case (no sets required). Moved the sanity check
for nr_sets after the invocation of the callback so it catches
broken drivers. Fixed the kernel doc comments for struct
irq_affinity and de-'This patch'-ed the changelog ]
Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.512444498@linutronix.de
Ming Lei [Sat, 16 Feb 2019 17:13:08 +0000 (18:13 +0100)]
genirq/affinity: Store interrupt sets size in struct irq_affinity
The interrupt affinity spreading mechanism supports to spread out
affinities for one or more interrupt sets. A interrupt set contains one
or more interrupts. Each set is mapped to a specific functionality of a
device, e.g. general I/O queues and read I/O queus of multiqueue block
devices.
The number of interrupts per set is defined by the driver. It depends on
the total number of available interrupts for the device, which is
determined by the PCI capabilites and the availability of underlying CPU
resources, and the number of queues which the device provides and the
driver wants to instantiate.
The driver passes initial configuration for the interrupt allocation via
a pointer to struct irq_affinity.
Right now the allocation mechanism is complex as it requires to have a
loop in the driver to determine the maximum number of interrupts which
are provided by the PCI capabilities and the underlying CPU resources.
This loop would have to be replicated in every driver which wants to
utilize this mechanism. That's unwanted code duplication and error
prone.
In order to move this into generic facilities it is required to have a
mechanism, which allows the recalculation of the interrupt sets and
their size, in the core code. As the core code does not have any
knowledge about the underlying device, a driver specific callback will
be added to struct affinity_desc, which will be invoked by the core
code. The callback will get the number of available interupts as an
argument, so the driver can calculate the corresponding number and size
of interrupt sets.
To support this, two modifications for the handling of struct irq_affinity
are required:
1) The (optional) interrupt sets size information is contained in a
separate array of integers and struct irq_affinity contains a
pointer to it.
This is cumbersome and as the maximum number of interrupt sets is small,
there is no reason to have separate storage. Moving the size array into
struct affinity_desc avoids indirections and makes the code simpler.
2) At the moment the struct irq_affinity pointer which is handed in from
the driver and passed through to several core functions is marked
'const'.
With the upcoming callback to recalculate the number and size of
interrupt sets, it's necessary to remove the 'const'
qualifier. Otherwise the callback would not be able to update the data.
Implement #1 and store the interrupt sets size in 'struct irq_affinity'.
No functional change.
[ tglx: Fixed the memcpy() size so it won't copy beyond the size of the
source. Fixed the kernel doc comments for struct irq_affinity and
de-'This patch'-ed the changelog ]
Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.423723127@linutronix.de
Aaro Koskinen [Wed, 6 Feb 2019 21:26:08 +0000 (23:26 +0200)]
irqchip/i8259: Fix shutdown order by moving syscore_ops registration
When using cpufreq on Loongson 2F MIPS platform, "poweroff"
command gets frequently stuck in syscore_shutdown(). The reason is
that i8259A_shutdown() gets called before cpufreq_suspend(), and if we
have pending work then irq_work_sync() in cpufreq_dbs_governor_stop()
gets stuck forever as we have all interrupts masked already.
irq-i8259 is registering syscore_ops using device_initcall(),
while cpufreq uses core_initcall(). Fix the shutdown order simply
by registering the irq syscore_ops during the early IRQ init instead
of using a separate initcall at later stage.
Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Zenghui Yu [Sun, 10 Feb 2019 05:24:10 +0000 (05:24 +0000)]
irqchip/gic-v3-its: Avoid parsing _indirect_ twice for Device table
In current logic, its_parse_indirect_baser() will be invoked twice
when allocating Device tables. Add a *break* to omit the unnecessary
and annoying (might be ...) invoking.
Fixes: 32bd44dc19de ("irqchip/gic-v3-its: Fix the incorrect parsing of VCPU table size") Cc: stable@vger.kernel.org Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Matthias Kaehlcke [Mon, 28 Jan 2019 23:46:25 +0000 (15:46 -0800)]
softirq: Don't skip softirq execution when softirq thread is parking
When a CPU is unplugged the kernel threads of this CPU are parked (see
smpboot_park_threads()). kthread_park() is used to mark each thread as
parked and wake it up, so it can complete the process of parking itselfs
(see smpboot_thread_fn()).
If local softirqs are pending on interrupt exit invoke_softirq() is called
to process the softirqs, however it skips processing when the softirq
kernel thread of the local CPU is scheduled to run. The softirq kthread is
one of the threads that is parked when a CPU is unplugged. Parking the
kthread wakes it up, however only to complete the parking process, not to
process the pending softirqs. Hence processing of softirqs at the end of an
interrupt is skipped, but not done elsewhere, which can result in warnings
about pending softirqs when a CPU is unplugged:
Don't skip processing of softirqs at the end of an interrupt when the
softirq thread of the CPU is parking.
Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Douglas Anderson <dianders@chromium.org> Cc: Stephen Boyd <swboyd@chromium.org> Link: https://lkml.kernel.org/r/20190128234625.78241-3-mka@chromium.org
Matthias Kaehlcke [Mon, 28 Jan 2019 23:46:24 +0000 (15:46 -0800)]
kthread: Add __kthread_should_park()
kthread_should_park() is used to check if the calling kthread ('current')
should park, but there is no function to check whether an arbitrary kthread
should be parked. The latter is required to plug a CPU hotplug race vs. a
parking ksoftirqd thread.
The new __kthread_should_park() receives a task_struct as parameter to
check if the corresponding kernel thread should be parked.
Call __kthread_should_park() from kthread_should_park() to avoid code
duplication.
Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Douglas Anderson <dianders@chromium.org> Cc: Stephen Boyd <swboyd@chromium.org> Link: https://lkml.kernel.org/r/20190128234625.78241-2-mka@chromium.org
Thomas Gleixner [Fri, 8 Feb 2019 13:48:04 +0000 (14:48 +0100)]
proc/stat: Make the interrupt statistics more efficient
Waiman reported that on large systems with a large amount of interrupts the
readout of /proc/stat takes a long time to sum up the interrupt
statistics. In principle this is not a problem. but for unknown reasons
some enterprise quality software reads /proc/stat with a high frequency.
The reason for this is that interrupt statistics are accounted per cpu. So
the /proc/stat logic has to sum up the interrupt stats for each interrupt.
The interrupt core provides now a per interrupt summary counter which can
be used to avoid the summation loops completely except for interrupts
marked PER_CPU which are only a small fraction of the interrupt space if at
all.
Another simplification is to iterate only over the active interrupts and
skip the potentially large gaps in the interrupt number space and just
print zeros for the gaps without going into the interrupt core in the first
place.
Waiman provided test results from a 4-socket IvyBridge-EX system (60-core
120-thread, 3016 irqs) excuting a test program which reads /proc/stat
50,000 times:
Thomas Gleixner [Fri, 8 Feb 2019 13:48:03 +0000 (14:48 +0100)]
genirq: Avoid summation loops for /proc/stat
Waiman reported that on large systems with a large amount of interrupts the
readout of /proc/stat takes a long time to sum up the interrupt
statistics. In principle this is not a problem. but for unknown reasons
some enterprise quality software reads /proc/stat with a high frequency.
The reason for this is that interrupt statistics are accounted per cpu. So
the /proc/stat logic has to sum up the interrupt stats for each interrupt.
This can be largely avoided for interrupts which are not marked as
'PER_CPU' interrupts by simply adding a per interrupt summation counter
which is incremented along with the per interrupt per cpu counter.
The PER_CPU interrupts need to avoid that and use only per cpu accounting
because they share the interrupt number and the interrupt descriptor and
concurrent updates would conflict or require unwanted synchronization.
Reported-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: linux-fsdevel@vger.kernel.org Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Randy Dunlap <rdunlap@infradead.org> Link: https://lkml.kernel.org/r/20190208135020.925487496@linutronix.de
8<-------------
v2: Undo the unintentional layout change of struct irq_desc.
Linus Torvalds [Thu, 7 Feb 2019 22:53:26 +0000 (15:53 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"Three security fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
Linus Torvalds [Thu, 7 Feb 2019 22:44:45 +0000 (15:44 -0700)]
Merge tag 'nfsd-5.0-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
"Two small nfsd bugfixes for 5.0, for an RDMA bug and a file clone bug"
* tag 'nfsd-5.0-1' of git://linux-nfs.org/~bfields/linux:
svcrdma: Remove max_sge check at connect time
nfsd: Fix error return values for nfsd4_clone_file_range()
Linus Torvalds [Thu, 7 Feb 2019 22:42:43 +0000 (15:42 -0700)]
Merge tag 'for-5.0/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
"Both of these fixes address issues in changes merged for 5.0-rc4:
- Fix DM core's missing memory barrier before waitqueue_active()
calls.
- Fix DM core's clone_bio() to work when cloning a subset of a bio
with an integrity payload; bio_integrity_trim() wasn't getting
called due to bio_trim()'s early return"
* tag 'for-5.0/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: don't use bio_trim() afterall
dm: add memory barrier before waitqueue_active
There are multiple code paths where an hrtimer may have been started to
emulate an L1 VMX preemption timer that can result in a call to free_nested
without an intervening L2 exit where the hrtimer is normally
cancelled. Unconditionally cancel in free_nested to cover all cases.
Embargoed until Feb 7th 2019.
Signed-off-by: Peter Shier <pshier@google.com> Reported-by: Jim Mattson <jmattson@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reported-by: Felix Wilhelm <fwilhelm@google.com> Cc: stable@kernel.org
Message-Id: <20181011184646.154065-1-pshier@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Emulation of certain instructions (VMXON, VMCLEAR, VMPTRLD, VMWRITE with
memory operand, INVEPT, INVVPID) can incorrectly inject a page fault
when passed an operand that points to an MMIO address. The page fault
will use uninitialized kernel stack memory as the CR2 and error code.
The right behavior would be to abort the VM with a KVM_EXIT_INTERNAL_ERROR
exit to userspace; however, it is not an easy fix, so for now just
ensure that the error code and CR2 are zero.
Embargoed until Feb 7th 2019.
Reported-by: Felix Wilhelm <fwilhelm@google.com> Cc: stable@kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
1. creates a device that holds a reference to the VM object (with a borrowed
reference, the VM's refcount has not been bumped yet)
2. initializes the device
3. transfers the reference to the device to the caller's file descriptor table
4. calls kvm_get_kvm() to turn the borrowed reference to the VM into a real
reference
The ownership transfer in step 3 must not happen before the reference to the VM
becomes a proper, non-borrowed reference, which only happens in step 4.
After step 3, an attacker can close the file descriptor and drop the borrowed
reference, which can cause the refcount of the kvm object to drop to zero.
This means that we need to grab a reference for the device before
anon_inode_getfd(), otherwise the VM can disappear from under us.
Fixes: 852b6d57dc7f ("kvm: add device control API") Cc: stable@kernel.org Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Linus Torvalds [Thu, 7 Feb 2019 08:33:56 +0000 (08:33 +0000)]
Merge tag 'sound-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of a few small fixes.
The most significant one is the fix for the possible race at loading
HD-audio drivers. This has been present for long time and surfaced
only in a rare occasion, but finally spotted out"
* tag 'sound-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/ca0132 - Fix build error without CONFIG_PCI
ALSA: compress: Fix stop handling on compressed capture streams
ALSA: usb-audio: Add support for new T+A USB DAC
ALSA: hda - Serialize codec registrations
ALSA: hda/realtek - Use a common helper for hp pin reference
ALSA: hda/realtek - Fix lose hp_pins for disable auto mute
ALSA: hda/realtek - Headset microphone support for System76 darp5
Linus Torvalds [Thu, 7 Feb 2019 08:05:28 +0000 (08:05 +0000)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio fixes from Michael Tsirkin:
"A small fix for a uapi header, and a fix for VDPA for non-x86 guests"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio: drop internal struct from UAPI
virtio: support VIRTIO_F_ORDER_PLATFORM
Linus Torvalds [Thu, 7 Feb 2019 07:59:01 +0000 (07:59 +0000)]
Merge tag 'trace-v5.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"This has two fixes for uprobe code.
- Cut and paste fix to have uprobe printks say "uprobe" and not
"kprobe"
- Add terminating '\0' byte when copying function arguments"
* tag 'trace-v5.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/uprobes: Fix output for multiple string arguments
tracing: uprobes: Fix typo in pr_fmt string
Linus Torvalds [Thu, 7 Feb 2019 07:52:08 +0000 (07:52 +0000)]
Merge tag 'fuse-fixes-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
"A fix for a CUSE regression introduced in v4.20, as well as fixes for
a couple of old bugs"
* tag 'fuse-fixes-5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: decrement NR_WRITEBACK_TEMP on the right page
fuse: call pipe_buf_release() under pipe lock
cuse: fix ioctl
fuse: handle zero sized retrieve correctly
Linus Torvalds [Thu, 7 Feb 2019 07:47:08 +0000 (07:47 +0000)]
Merge tag 'pinctrl-v5.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
- Mediatek Kconfig fix
- Sunxi regulator, IRQ banks and pin base fixup
- Intel Cherryview Strago DMI workaround
- Potential regmap problem on mcp23s08
* tag 'pinctrl-v5.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: sunxi: Correct number of IRQ banks on H6 main pin controller
pinctrl: mcp23s08: spi: Fix regmap allocation for mcp23s18
pinctrl: cherryview: fix Strago DMI workaround
pinctrl: sunxi: Consider pin_base when calculating regulator array index
pinctrl: sunxi: Fix and simplify pin bank regulator handling
pinctrl: mediatek: fix Kconfig build errors for moore core
Mike Snitzer [Tue, 5 Feb 2019 22:07:58 +0000 (17:07 -0500)]
dm: don't use bio_trim() afterall
bio_trim() has an early return, which makes it _not_ idempotent, if the
offset is 0 and the bio's bi_size already matches the requested size.
Prior to DM, all users of bio_trim() were fine with this. But DM has
exposed the fact that bio_trim()'s early return is incompatible with a
cloned bio whose integrity payload must be trimmed via
bio_integrity_trim().
Fix this by reverting DM back to doing the equivalent of bio_trim() but
in an idempotent manner (so bio_integrity_trim is always performed).
Follow-on work is needed to assess what benefit bio_trim()'s early
return is providing to its existing callers.
Reported-by: Milan Broz <gmazyland@gmail.com> Fixes: 57c36519e4b94 ("dm: fix clone_bio() to trigger blk_recount_segments()") Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mikulas Patocka [Tue, 5 Feb 2019 10:09:00 +0000 (05:09 -0500)]
dm: add memory barrier before waitqueue_active
Block core changes to switch bio-based IO accounting to be percpu had a
side-effect of altering DM core to now rely on calling waitqueue_active
(in both bio-based and request-based) to check if another task is in
dm_wait_for_completion().
A memory barrier is needed before calling waitqueue_active(). DM core
doesn't piggyback on a preceding memory barrier so it must explicitly
use its own.
For more details on why using waitqueue_active() without a preceding
barrier is unsafe, please see the comment before the waitqueue_active()
definition in include/linux/wait.h.
Add the missing memory barrier by switching to using wq_has_sleeper().
Fixes: 6f75723190d8 ("dm: remove the pending IO accounting") Fixes: c4576aed8d85 ("dm: fix request-based dm's use of dm_wait_for_completion") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Chuck Lever [Fri, 25 Jan 2019 21:54:54 +0000 (16:54 -0500)]
svcrdma: Remove max_sge check at connect time
Two and a half years ago, the client was changed to use gathered
Send for larger inline messages, in commit 655fec6987b ("xprtrdma:
Use gathered Send for large inline messages"). Several fixes were
required because there are a few in-kernel device drivers whose
max_sge is 3, and these were broken by the change.
Apparently my memory is going, because some time later, I submitted
commit 25fd86eca11c ("svcrdma: Don't overrun the SGE array in
svc_rdma_send_ctxt"), and after that, commit f3c1fd0ee294 ("svcrdma:
Reduce max_send_sges"). These too incorrectly assumed in-kernel
device drivers would have more than a few Send SGEs available.
The fix for the server side is not the same. This is because the
fundamental problem on the server is that, whether or not the client
has provisioned a chunk for the RPC reply, the server must squeeze
even the most complex RPC replies into a single RDMA Send. Failing
in the send path because of Send SGE exhaustion should never be an
option.
Therefore, instead of failing when the send path runs out of SGEs,
switch to using a bounce buffer mechanism to handle RPC replies that
are too complex for the device to send directly. That allows us to
remove the max_sge check to enable drivers with small max_sge to
work again.
Reported-by: Don Dutile <ddutile@redhat.com> Fixes: 25fd86eca11c ("svcrdma: Don't overrun the SGE array in ...") Cc: stable@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Trond Myklebust [Mon, 21 Jan 2019 20:58:38 +0000 (15:58 -0500)]
nfsd: Fix error return values for nfsd4_clone_file_range()
If the parameter 'count' is non-zero, nfsd4_clone_file_range() will
currently clobber all errors returned by vfs_clone_file_range() and
replace them with EINVAL.
Fixes: 42ec3d4c0218 ("vfs: make remap_file_range functions take and...") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org # v4.20+ Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Takashi Iwai [Tue, 5 Feb 2019 16:57:27 +0000 (17:57 +0100)]
ALSA: hda/ca0132 - Fix build error without CONFIG_PCI
A call of pci_iounmap() call without CONFIG_PCI leads to a build error
on some architectures. We tried to address this and add a check of
IS_ENABLED(CONFIG_PCI), but this still doesn't seem enough for sh.
Ideally we should fix it globally, it's really a corner case, so let's
paper over it with a simpler ifdef.
Charles Keepax [Tue, 5 Feb 2019 16:29:40 +0000 (16:29 +0000)]
ALSA: compress: Fix stop handling on compressed capture streams
It is normal user behaviour to start, stop, then start a stream
again without closing it. Currently this works for compressed
playback streams but not capture ones.
The states on a compressed capture stream go directly from OPEN to
PREPARED, unlike a playback stream which moves to SETUP and waits
for a write of data before moving to PREPARED. Currently however,
when a stop is sent the state is set to SETUP for both types of
streams. This leaves a capture stream in the situation where a new
start can't be sent as that requires the state to be PREPARED and
a new set_params can't be sent as that requires the state to be
OPEN. The only option being to close the stream, and then reopen.
Correct this issues by allowing snd_compr_drain_notify to set the
state depending on the stream direction, as we already do in
set_params.
Fixes: 49bb6402f1aa ("ALSA: compress_core: Add support for capture streams") Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
Udo Eberhardt [Tue, 5 Feb 2019 16:20:47 +0000 (17:20 +0100)]
ALSA: usb-audio: Add support for new T+A USB DAC
This patch adds the T+A VID to the generic check in order to enable
native DSD support for T+A devices. This works with the new T+A USB
DAC model SD3100HV and will also work with future devices which
support the XMOS/Thesycon style DSD format.
Julien Thierry [Thu, 31 Jan 2019 14:54:01 +0000 (14:54 +0000)]
irqdesc: Add domain handler for NMIs
NMI handling code should be executed between calls to nmi_enter and
nmi_exit.
Add a separate domain handler to properly setup NMI context when handling
an interrupt requested as NMI.
Signed-off-by: Julien Thierry <julien.thierry@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Julien Thierry [Thu, 31 Jan 2019 14:54:00 +0000 (14:54 +0000)]
genirq: Provide NMI handlers
Provide flow handlers that are NMI safe for interrupts and percpu_devid
interrupts.
Signed-off-by: Julien Thierry <julien.thierry@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Julien Thierry [Thu, 31 Jan 2019 14:53:58 +0000 (14:53 +0000)]
genirq: Provide basic NMI management for interrupt lines
Add functionality to allocate interrupt lines that will deliver IRQs
as Non-Maskable Interrupts. These allocations are only successful if
the irqchip provides the necessary support and allows NMI delivery for the
interrupt line.
Interrupt lines allocated for NMI delivery must be enabled/disabled through
enable_nmi/disable_nmi_nosync to keep their state consistent.
To treat a PERCPU IRQ as NMI, the interrupt must not be shared nor threaded,
the irqchip directly managing the IRQ must be the root irqchip and the
irqchip cannot be behind a slow bus.
Signed-off-by: Julien Thierry <julien.thierry@arm.com> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Linus Torvalds [Sun, 3 Feb 2019 17:08:12 +0000 (09:08 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A few updates for x86:
- Fix an unintended sign extension issue in the fault handling code
- Rename the new resource control config switch so it's less
confusing
- Avoid setting up EFI info in kexec when the EFI runtime is
disabled.
- Fix the microcode version check in the AMD microcode loader so it
only loads higher version numbers and never downgrades
- Set EFER.LME in the 32bit trampoline before returning to long mode
to handle older AMD/KVM behaviour properly.
- Add Darren and Andy as x86/platform reviewers"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/resctrl: Avoid confusion over the new X86_RESCTRL config
x86/kexec: Don't setup EFI info if EFI runtime is not enabled
x86/microcode/amd: Don't falsely trick the late loading mechanism
MAINTAINERS: Add Andy and Darren as arch/x86/platform/ reviewers
x86/fault: Fix sign-extend unintended sign extension
x86/boot/compressed/64: Set EFER.LME=1 in 32-bit trampoline before returning to long mode
x86/cpu: Add Atom Tremont (Jacobsville)
Linus Torvalds [Sun, 3 Feb 2019 17:02:03 +0000 (09:02 -0800)]
Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull cpu hotplug fixes from Thomas Gleixner:
"Two fixes for the cpu hotplug machinery:
- Replace the overly clever 'SMT disabled by BIOS' detection logic as
it breaks KVM scenarios and prevents speculation control updates
when the Hyperthreads are brought online late after boot.
- Remove a redundant invocation of the speculation control update
function"
* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM
x86/speculation: Remove redundant arch_smt_update() invocation
Linus Torvalds [Sun, 3 Feb 2019 16:59:51 +0000 (08:59 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"A pile of perf updates:
- Fix broken sanity check in the /proc/sys/kernel/perf_cpu_time_max_percent
write handler
- Cure a perf script crash which caused by an unitinialized data
structure
- Highlight the hottest instruction in perf top and not a random one
- Cure yet another clang issue when building perf python
- Handle topology entries with no CPU correctly in the tools
- Handle perf data which contains both tracepoints and performance
counter entries correctly.
- Add a missing NULL pointer check in perf ordered_events_free()"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf script: Fix crash when processing recorded stat data
perf top: Fix wrong hottest instruction highlighted
perf tools: Handle TOPOLOGY headers with no CPU
perf python: Remove -fstack-clash-protection when building with some clang versions
perf core: Fix perf_proc_update_handler() bug
perf script: Fix crash with printing mixed trace point and other events
perf ordered_events: Fix crash in ordered_events__free
Linus Torvalds [Sun, 3 Feb 2019 16:57:05 +0000 (08:57 -0800)]
Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull EFI fix from Thomas Gleixner:
"The dump info for the efi page table debugging lacks a terminator
which causes the kernel to crash when the debugfile is read"
* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi/arm64: Fix debugfs crash by adding a terminator for ptdump marker
Linus Torvalds [Sun, 3 Feb 2019 16:48:33 +0000 (08:48 -0800)]
Merge tag 'for-5.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- regression fix: transaction commit can run away due to delayed ref
waiting heuristic, this is not necessary now because of the proper
reservation mechanism introduced in 5.0
- regression fix: potential crash due to use-before-check of an ERR_PTR
return value
- fix for transaction abort during transaction commit that needs to
properly clean up pending block groups
- fix deadlock during b-tree node/leaf splitting, when this happens on
some of the fundamental trees, we must prevent new tree block
allocation to re-enter indirectly via the block group flushing path
- potential memory leak after errors during mount
* tag 'for-5.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: On error always free subvol_name in btrfs_mount
btrfs: clean up pending block groups when transaction commit aborts
btrfs: fix potential oops in device_list_add
btrfs: don't end the transaction for delayed refs in throttle
Btrfs: fix deadlock when allocating tree block during leaf/node split
Linus Torvalds [Sat, 2 Feb 2019 18:34:32 +0000 (10:34 -0800)]
Merge tag 'devicetree-fixes-for-5.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
Pull Devicetree fix from Rob Herring:
"A single fix for building DT bindings in-tree"
* tag 'devicetree-fixes-for-5.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
dt-bindings: Fix dt_binding_check target for in tree builds
Linus Torvalds [Sat, 2 Feb 2019 18:26:14 +0000 (10:26 -0800)]
Merge tag 'riscv-for-linus-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux
Pull RISC-V fixes from Palmer Dabbelt:
"This contains a handful of mostly-independent patches:
- make our port respect TIF_NEED_RESCHED, which fixes
CONFIG_PREEMPT=y kernels
- fix double-put of OF nodes
- fix a misspelling of target in our Kconfig
- generic PCIe is enabled in our defconfig
- fix our SBI early console to properly handle line
endings
- fix max_low_pfn being counted in PFNs
- a change to TASK_UNMAPPED_BASE to match what other
arches do
This has passed my standard 'boot Fedora' flow"
* tag 'riscv-for-linus-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux:
riscv: Adjust mmap base address at a third of task size
riscv: fixup max_low_pfn with PFN_DOWN.
tty/serial: use uart_console_write in the RISC-V SBL early console
RISC-V: defconfig: Add CRYPTO_DEV_VIRTIO=y
RISC-V: defconfig: Enable Generic PCIE by default
RISC-V: defconfig: Move CONFIG_PCI{,E_XILINX}
RISC-V: Kconfig: fix spelling mistake "traget" -> "target"
RISC-V: asm/page.h: fix spelling mistake "CONFIG_64BITS" -> "CONFIG_64BIT"
RISC-V: fix bad use of of_node_put
RISC-V: Add _TIF_NEED_RESCHED check for kernel thread when CONFIG_PREEMPT=y
Linus Torvalds [Sat, 2 Feb 2019 18:16:28 +0000 (10:16 -0800)]
Merge tag 'for-linus-20190202' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A few fixes that should go into this release. This contains:
- MD pull request from Song, fixing a recovery OOM issue (Alexei)
- Fix for a sync related stall (Jianchao)
- Dummy callback for timeouts (Tetsuo)
- IDE atapi sense ordering fix (me)"
* tag 'for-linus-20190202' of git://git.kernel.dk/linux-block:
ide: ensure atapi sense request aren't preempted
blk-mq: fix a hung issue when fsync
block: pass no-op callback to INIT_WORK().
md/raid5: fix 'out of memory' during raid cache recovery
Linus Torvalds [Sat, 2 Feb 2019 17:32:58 +0000 (09:32 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"24 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (24 commits)
autofs: fix error return in autofs_fill_super()
autofs: drop dentry reference only when it is never used
fs/drop_caches.c: avoid softlockups in drop_pagecache_sb()
mm: migrate: don't rely on __PageMovable() of newpage after unlocking it
psi: clarify the Kconfig text for the default-disable option
mm, memory_hotplug: __offline_pages fix wrong locking
mm: hwpoison: use do_send_sig_info() instead of force_sig()
kasan: mark file common so ftrace doesn't trace it
init/Kconfig: fix grammar by moving a closing parenthesis
lib/test_kmod.c: potential double free in error handling
mm, oom: fix use-after-free in oom_kill_process
mm/hotplug: invalid PFNs from pfn_to_online_page()
mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepages
psi: fix aggregation idle shut-off
mm, memory_hotplug: test_pages_in_a_zone do not pass the end of zone
mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone
oom, oom_reaper: do not enqueue same task twice
mm: migrate: make buffer_migrate_page_norefs() actually succeed
kernel/exit.c: release ptraced tasks before zap_pid_ns_processes
x86_64: increase stack size for KASAN_EXTRA
...
because st->marker++ is is called after "UEFI runtime end" which is the
last element in addr_marker[]. Therefore, add a terminator like the one
for kernel_page_tables, so it can be skipped to print out non-existent
markers.
Here's the KASAN bug report:
# cat /sys/kernel/debug/efi_page_tables
---[ UEFI runtime start ]---
0x0000000020000000-0x0000000020010000 64K PTE RW NX SHD AF ...
0x0000000020200000-0x0000000021340000 17664K PTE RW NX SHD AF ...
...
0x0000000021920000-0x0000000021950000 192K PTE RW x SHD AF ...
0x0000000021950000-0x00000000219a0000 320K PTE RW NX SHD AF ...
---[ UEFI runtime end ]---
---[ (null) ]---
---[ (null) ]---
BUG: KASAN: global-out-of-bounds in note_page+0x1f0/0xac0
Read of size 8 at addr ffff2000123f2ac0 by task read_all/42464
Call trace:
dump_backtrace+0x0/0x298
show_stack+0x24/0x30
dump_stack+0xb0/0xdc
print_address_description+0x64/0x2b0
kasan_report+0x150/0x1a4
__asan_report_load8_noabort+0x30/0x3c
note_page+0x1f0/0xac0
walk_pgd+0xb4/0x244
ptdump_walk_pgd+0xec/0x140
ptdump_show+0x40/0x50
seq_read+0x3f8/0xad0
full_proxy_read+0x9c/0xc0
__vfs_read+0xfc/0x4c8
vfs_read+0xec/0x208
ksys_read+0xd0/0x15c
__arm64_sys_read+0x84/0x94
el0_svc_handler+0x258/0x304
el0_svc+0x8/0xc
The buggy address belongs to the variable:
__compound_literal.0+0x20/0x800
Memory state around the buggy address: ffff2000123f2980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff2000123f2a00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fa
>ffff2000123f2a80: fa fa fa fa 00 00 00 00 fa fa fa fa 00 00 00 00
^ ffff2000123f2b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff2000123f2b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0
[ ardb: fix up whitespace ]
[ mingo: fix up some moar ]
Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Fixes: 9d80448ac92b ("efi/arm64: Add debugfs node to dump UEFI runtime page tables") Link: http://lkml.kernel.org/r/20190202095017.13799-2-ard.biesheuvel@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
Johannes Weiner [Tue, 29 Jan 2019 22:44:36 +0000 (17:44 -0500)]
x86/resctrl: Avoid confusion over the new X86_RESCTRL config
"Resource Control" is a very broad term for this CPU feature, and a term
that is also associated with containers, cgroups etc. This can easily
cause confusion.
Make the user prompt more specific. Match the config symbol name.
[ bp: In the future, the corresponding ARM arch-specific code will be
under ARM_CPU_RESCTRL and the arch-agnostic bits will be carved out
under the CPU_RESCTRL umbrella symbol. ]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Babu Moger <Babu.Moger@amd.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morse <james.morse@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: linux-doc@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pu Wen <puwen@hygon.cn> Cc: Reinette Chatre <reinette.chatre@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190130195621.GA30653@cmpxchg.org
Linus Torvalds [Sat, 2 Feb 2019 00:56:30 +0000 (16:56 -0800)]
Merge tag 'xtensa-20190201' of git://github.com/jcmvbkbc/linux-xtensa
Pull xtensa fixes from Max Filippov:
- fix ccount_timer_shutdown for secondary CPUs
- fix secondary CPU initialization
- fix secondary CPU reset vector clash with double exception vector
- fix present CPUs when booting with 'maxcpus' parameter
- limit possible CPUs by configured NR_CPUS
- issue a warning if xtensa PIC is asked to retrigger anything other
than software IRQ
- fix masking/unmasking of the first two IRQs on xtensa MX PIC
- fix typo in Kconfig description for user space unaligned access
feature
- fix Kconfig warning for selecting BUILTIN_DTB
* tag 'xtensa-20190201' of git://github.com/jcmvbkbc/linux-xtensa:
xtensa: SMP: limit number of possible CPUs by NR_CPUS
xtensa: rename BUILTIN_DTB to BUILTIN_DTB_SOURCE
xtensa: Fix typo use space=>user space
drivers/irqchip: xtensa-mx: fix mask and unmask
drivers/irqchip: xtensa: add warning to irq_retrigger
xtensa: SMP: mark each possible CPU as present
xtensa: smp_lx200_defconfig: fix vectors clash
xtensa: SMP: fix secondary CPU initialization
xtensa: SMP: fix ccount_timer_shutdown
Linus Torvalds [Sat, 2 Feb 2019 00:54:25 +0000 (16:54 -0800)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"Although we're still debugging a few minor arm64-specific issues in
mainline, I didn't want to hold this lot up in the meantime.
We've got an additional KASLR fix after the previous one wasn't quite
complete, a fix for a performance regression when mapping executable
pages into userspace and some fixes for kprobe blacklisting. All
candidates for stable.
Summary:
- Fix module loading when KASLR is configured but disabled at runtime
- Fix accidental IPI when mapping user executable pages
- Ensure hyp-stub and KVM world switch code cannot be kprobed"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: hibernate: Clean the __hyp_text to PoC after resume
arm64: hyp-stub: Forbid kprobing of the hyp-stub
arm64: kprobe: Always blacklist the KVM world-switch code
arm64: kaslr: ensure randomized quantities are clean also when kaslr is off
arm64: Do not issue IPIs for user executable ptes
Linus Torvalds [Sat, 2 Feb 2019 00:53:01 +0000 (16:53 -0800)]
Merge tag '5.0-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb3 fixes from Steve French:
"SMB3 fixes, some from this week's SMB3 test evemt, 5 for stable and a
particularly important one for queryxattr (see xfstests 70 and 117)"
* tag '5.0-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module version number
CIFS: fix use-after-free of the lease keys
CIFS: Do not consider -ENODATA as stat failure for reads
CIFS: Do not count -ENODATA as failure for query directory
CIFS: Fix trace command logging for SMB2 reads and writes
CIFS: Fix possible oops and memory leaks in async IO
cifs: limit amount of data we request for xattrs to CIFSMaxBufSize
cifs: fix computation for MAX_SMB2_HDR_SIZE
Linus Torvalds [Sat, 2 Feb 2019 00:18:38 +0000 (16:18 -0800)]
Merge tag 'apparmor-pr-2019-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor
Pull apparmor bug fixes from John Johansen:
"Two bug fixes for apparmor:
- Fix aa_label_build() error handling for failed merges
- Fix warning about unused function apparmor_ipv6_postroute"
* tag 'apparmor-pr-2019-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor:
apparmor: Fix aa_label_build() error handling for failed merges
apparmor: Fix warning about unused function apparmor_ipv6_postroute
Pan Bian [Fri, 1 Feb 2019 22:21:26 +0000 (14:21 -0800)]
autofs: drop dentry reference only when it is never used
autofs_expire_run() calls dput(dentry) to drop the reference count of
dentry. However, dentry is read via autofs_dentry_ino(dentry) after
that. This may result in a use-free-bug. The patch drops the reference
count of dentry only when it is never used.
Jan Kara [Fri, 1 Feb 2019 22:21:23 +0000 (14:21 -0800)]
fs/drop_caches.c: avoid softlockups in drop_pagecache_sb()
When superblock has lots of inodes without any pagecache (like is the
case for /proc), drop_pagecache_sb() will iterate through all of them
without dropping sb->s_inode_list_lock which can lead to softlockups
(one of our customers hit this).
Fix the problem by going to the slow path and doing cond_resched() in
case the process needs rescheduling.
Link: http://lkml.kernel.org/r/20190114085343.15011-1-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Fri, 1 Feb 2019 22:21:19 +0000 (14:21 -0800)]
mm: migrate: don't rely on __PageMovable() of newpage after unlocking it
We had a race in the old balloon compaction code before b1123ea6d3b3
("mm: balloon: use general non-lru movable page feature") refactored it
that became visible after backporting 195a8c43e93d ("virtio-balloon:
deflate via a page list") without the refactoring.
The bug existed from commit d6d86c0a7f8d ("mm/balloon_compaction:
redesign ballooned pages management") till b1123ea6d3b3 ("mm: balloon:
use general non-lru movable page feature"). d6d86c0a7f8d
("mm/balloon_compaction: redesign ballooned pages management") was
backported to 3.12, so the broken kernels are stable kernels [3.12 -
4.7].
There was a subtle race between dropping the page lock of the newpage in
__unmap_and_move() and checking for __is_movable_balloon_page(newpage).
Just after dropping this page lock, virtio-balloon could go ahead and
deflate the newpage, effectively dequeueing it and clearing PageBalloon,
in turn making __is_movable_balloon_page(newpage) fail.
This resulted in dropping the reference of the newpage via
putback_lru_page(newpage) instead of put_page(newpage), leading to
page->lru getting modified and a !LRU page ending up in the LRU lists.
With 195a8c43e93d ("virtio-balloon: deflate via a page list")
backported, one would suddenly get corrupted lists in
release_pages_balloon():
- WARNING: CPU: 13 PID: 6586 at lib/list_debug.c:59 __list_del_entry+0xa1/0xd0
- list_del corruption. prev->next should be ffffe253961090a0, but was dead000000000100
Nowadays this race is no longer possible, but it is hidden behind very
ugly handling of __ClearPageMovable() and __PageMovable().
__ClearPageMovable() will not make __PageMovable() fail, only
PageMovable(). So the new check (__PageMovable(newpage)) will still
hold even after newpage was dequeued by virtio-balloon.
If anybody would ever change that special handling, the BUG would be
introduced again. So instead, make it explicit and use the information
of the original isolated page before migration.
This patch can be backported fairly easy to stable kernels (in contrast
to the refactoring).
Link: http://lkml.kernel.org/r/20190129233217.10747-1-david@redhat.com Fixes: d6d86c0a7f8d ("mm/balloon_compaction: redesign ballooned pages management") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Vratislav Bendel <vbendel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vratislav Bendel <vbendel@redhat.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Konstantin Khlebnikov <k.khlebnikov@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> [3.12 - 4.7] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Fri, 1 Feb 2019 22:21:15 +0000 (14:21 -0800)]
psi: clarify the Kconfig text for the default-disable option
The current help text caused some confusion in online forums about
whether or not to default-enable or default-disable psi in vendor
kernels. This is because it doesn't communicate the reason for why we
made this setting configurable in the first place: that the overhead is
non-zero in an artificial scheduler stress test.
Since this isn't representative of real workloads, and the effect was
not measurable in scheduler-heavy real world applications such as the
webservers and memcache installations at Facebook, it's fair to point
out that this is a pretty cautious option to select.
Link: http://lkml.kernel.org/r/20190129233617.16767-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 1 Feb 2019 22:21:12 +0000 (14:21 -0800)]
mm, memory_hotplug: __offline_pages fix wrong locking
Jan has noticed that we do double unlock on some failure paths when
offlining a page range. This is indeed the case when
test_pages_in_a_zone respp. start_isolate_page_range fail. This was an
omission when forward porting the debugging patch from an older kernel.
Fix the issue by dropping mem_hotplug_done from the failure condition
and keeping the single unlock in the catch all failure path.
Link: http://lkml.kernel.org/r/20190115120307.22768-1-mhocko@kernel.org Fixes: 7960509329c2 ("mm, memory_hotplug: print reason for the offlining failure") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Jan Kara <jack@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Fri, 1 Feb 2019 22:21:08 +0000 (14:21 -0800)]
mm: hwpoison: use do_send_sig_info() instead of force_sig()
Currently memory_failure() is racy against process's exiting, which
results in kernel crash by null pointer dereference.
The root cause is that memory_failure() uses force_sig() to forcibly
kill asynchronous (meaning not in the current context) processes. As
discussed in thread https://lkml.org/lkml/2010/6/8/236 years ago for OOM
fixes, this is not a right thing to do. OOM solves this issue by using
do_send_sig_info() as done in commit d2d393099de2 ("signal:
oom_kill_task: use SEND_SIG_FORCED instead of force_sig()"), so this
patch is suggesting to do the same for hwpoison. do_send_sig_info()
properly accesses to siglock with lock_task_sighand(), so is free from
the reported race.
I confirmed that the reported bug reproduces with inserting some delay
in kill_procs(), and it never reproduces with this patch.
Note that memory_failure() can send another type of signal using
force_sig_mceerr(), and the reported race shouldn't happen on it because
force_sig_mceerr() is called only for synchronous processes (i.e.
BUS_MCEERR_AR happens only when some process accesses to the corrupted
memory.)
Link: http://lkml.kernel.org/r/20190116093046.GA29835@hori1.linux.bs1.fc.nec.co.jp Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Anders Roxell [Fri, 1 Feb 2019 22:21:05 +0000 (14:21 -0800)]
kasan: mark file common so ftrace doesn't trace it
When option CONFIG_KASAN is enabled toghether with ftrace, function
ftrace_graph_caller() gets in to a recursion, via functions
kasan_check_read() and kasan_check_write().
Breakpoint 2, ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179
179 mcount_get_pc x0 // function's pc
(gdb) bt
#0 ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179
#1 0xffffff90101406c8 in ftrace_caller () at ../arch/arm64/kernel/entry-ftrace.S:151
#2 0xffffff90106fd084 in kasan_check_write (p=0xffffffc06c170878, size=4) at ../mm/kasan/common.c:105
#3 0xffffff90104a2464 in atomic_add_return (v=<optimized out>, i=<optimized out>) at ./include/generated/atomic-instrumented.h:71
#4 atomic_inc_return (v=<optimized out>) at ./include/generated/atomic-fallback.h:284
#5 trace_graph_entry (trace=0xffffffc03f5ff380) at ../kernel/trace/trace_functions_graph.c:441
#6 0xffffff9010481774 in trace_graph_entry_watchdog (trace=<optimized out>) at ../kernel/trace/trace_selftest.c:741
#7 0xffffff90104a185c in function_graph_enter (ret=<optimized out>, func=<optimized out>, frame_pointer=18446743799894897728, retp=<optimized out>) at ../kernel/trace/trace_functions_graph.c:196
#8 0xffffff9010140628 in prepare_ftrace_return (self_addr=18446743592948977792, parent=0xffffffc03f5ff418, frame_pointer=18446743799894897728) at ../arch/arm64/kernel/ftrace.c:231
#9 0xffffff90101406f4 in ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:182
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)
Rework so that the kasan implementation isn't traced.
Dan Carpenter [Fri, 1 Feb 2019 22:20:58 +0000 (14:20 -0800)]
lib/test_kmod.c: potential double free in error handling
There is a copy and paste bug so we set "config->test_driver" to NULL
twice instead of setting "config->test_fs". Smatch complains that it
leads to a double free:
Link: http://lkml.kernel.org/r/20190121140011.GA14283@kadam Fixes: d9c6a72d6fa2 ("kmod: add test driver to stress test the module loader") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shakeel Butt [Fri, 1 Feb 2019 22:20:54 +0000 (14:20 -0800)]
mm, oom: fix use-after-free in oom_kill_process
Syzbot instance running on upstream kernel found a use-after-free bug in
oom_kill_process. On further inspection it seems like the process
selected to be oom-killed has exited even before reaching
read_lock(&tasklist_lock) in oom_kill_process(). More specifically the
tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task()
and the put_task_struct within for_each_thread() frees the tsk and
for_each_thread() tries to access the tsk. The easiest fix is to do
get/put across the for_each_thread() on the selected task.
Now the next question is should we continue with the oom-kill as the
previously selected task has exited? However before adding more
complexity and heuristics, let's answer why we even look at the children
of oom-kill selected task? The select_bad_process() has already selected
the worst process in the system/memcg. Due to race, the selected
process might not be the worst at the kill time but does that matter?
The userspace can use the oom_score_adj interface to prefer children to
be killed before the parent. I looked at the history but it seems like
this is there before git history.
Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com Fixes: 6b0c81b3be11 ("mm, oom: reduce dependency on tasklist_lock") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qian Cai [Fri, 1 Feb 2019 22:20:51 +0000 (14:20 -0800)]
mm/hotplug: invalid PFNs from pfn_to_online_page()
On an arm64 ThunderX2 server, the first kmemleak scan would crash [1]
with CONFIG_DEBUG_VM_PGFLAGS=y due to page_to_nid() found a pfn that is
not directly mapped (MEMBLOCK_NOMAP). Hence, the page->flags is
uninitialized.
This is due to the commit 9f1eb38e0e11 ("mm, kmemleak: little
optimization while scanning") starts to use pfn_to_online_page() instead
of pfn_valid(). However, in the CONFIG_MEMORY_HOTPLUG=y case,
pfn_to_online_page() does not call memblock_is_map_memory() while
pfn_valid() does.
Historically, the commit 68709f45385a ("arm64: only consider memblocks
with NOMAP cleared for linear mapping") causes pages marked as nomap
being no long reassigned to the new zone in memmap_init_zone() by
calling __init_single_page().
Since the commit 2d070eab2e82 ("mm: consider zone which is not fully
populated to have holes") introduced pfn_to_online_page() and was
designed to return a valid pfn only, but it is clearly broken on arm64.
Therefore, let pfn_to_online_page() call pfn_valid_within(), so it can
handle nomap thanks to the commit f52bb98f5ade ("arm64: mm: always
enable CONFIG_HOLES_IN_ZONE"), while it will be optimized away on
architectures where have no HOLES_IN_ZONE.
Oscar Salvador [Fri, 1 Feb 2019 22:20:47 +0000 (14:20 -0800)]
mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepages
This is the same sort of error we saw in commit 17e2e7d7e1b8 ("mm,
page_alloc: fix has_unmovable_pages for HugePages").
Gigantic hugepages cross several memblocks, so it can be that the page
we get in scan_movable_pages() is a page-tail belonging to a
1G-hugepage. If that happens, page_hstate()->size_to_hstate() will
return NULL, and we will blow up in hugepage_migration_supported().
Johannes Weiner [Fri, 1 Feb 2019 22:20:42 +0000 (14:20 -0800)]
psi: fix aggregation idle shut-off
psi has provisions to shut off the periodic aggregation worker when
there is a period of no task activity - and thus no data that needs
aggregating. However, while developing psi monitoring, Suren noticed
that the aggregation clock currently won't stay shut off for good.
Debugging this revealed a flaw in the idle design: an aggregation run
will see no task activity and decide to go to sleep; shortly thereafter,
the kworker thread that executed the aggregation will go idle and cause
a scheduling change, during which the psi callback will kick the
!pending worker again. This will ping-pong forever, and is equivalent
to having no shut-off logic at all (but with more code!)
Fix this by exempting aggregation workers from psi's clock waking logic
when the state change is them going to sleep. To do this, tag workers
with the last work function they executed, and if in psi we see a worker
going to sleep after aggregating psi data, we will not reschedule the
aggregation work item.
What if the worker is also executing other items before or after?
Any psi state times that were incurred by work items preceding the
aggregation work will have been collected from the per-cpu buckets
during the aggregation itself. If there are work items following the
aggregation work, the worker's last_func tag will be overwritten and the
aggregator will be kept alive to process this genuine new activity.
If the aggregation work is the last thing the worker does, and we decide
to go idle, the brief period of non-idle time incurred between the
aggregation run and the kworker's dequeue will be stranded in the
per-cpu buckets until the clock is woken by later activity. But that
should not be a problem. The buckets can hold 4s worth of time, and
future activity will wake the clock with a 2s delay, giving us 2s worth
of data we can leave behind when disabling aggregation. If it takes a
worker more than two seconds to go idle after it finishes its last work
item, we likely have bigger problems in the system, and won't notice one
sample that was averaged with a bogus per-CPU weight.
Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mikhail Zaslonko [Fri, 1 Feb 2019 22:20:38 +0000 (14:20 -0800)]
mm, memory_hotplug: test_pages_in_a_zone do not pass the end of zone
If memory end is not aligned with the sparse memory section boundary,
the mapping of such a section is only partly initialized. This may lead
to VM_BUG_ON due to uninitialized struct pages access from
test_pages_in_a_zone() function triggered by memory_hotplug sysfs
handlers.
Here are the the panic examples:
CONFIG_DEBUG_VM_PGFLAGS=y
kernel parameter mem=2050M
--------------------------
page:000003d082008000 is uninitialized and poisoned
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
Call Trace:
test_pages_in_a_zone+0xde/0x160
show_valid_zones+0x5c/0x190
dev_attr_show+0x34/0x70
sysfs_kf_seq_show+0xc8/0x148
seq_read+0x204/0x480
__vfs_read+0x32/0x178
vfs_read+0x82/0x138
ksys_read+0x5a/0xb0
system_call+0xdc/0x2d8
Last Breaking-Event-Address:
test_pages_in_a_zone+0xde/0x160
Kernel panic - not syncing: Fatal exception: panic_on_oops
Fix this by checking whether the pfn to check is within the zone.
[mhocko@suse.com: separated this change from http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com] Link: http://lkml.kernel.org/r/20190128144506.15603-3-mhocko@kernel.org
[mhocko@suse.com: separated this change from
http://lkml.kernel.org/r/20181105150401.97287-2-zaslonko@linux.ibm.com] Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 1 Feb 2019 22:20:34 +0000 (14:20 -0800)]
mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone
Patch series "mm, memory_hotplug: fix uninitialized pages fallouts", v2.
Mikhail Zaslonko has posted fixes for the two bugs quite some time ago
[1]. I have pushed back on those fixes because I believed that it is
much better to plug the problem at the initialization time rather than
play whack-a-mole all over the hotplug code and find all the places
which expect the full memory section to be initialized.
We have ended up with commit 2830bf6f05fb ("mm, memory_hotplug:
initialize struct pages for the full memory section") merged and cause a
regression [2][3]. The reason is that there might be memory layouts
when two NUMA nodes share the same memory section so the merged fix is
simply incorrect.
In order to plug this hole we really have to be zone range aware in
those handlers. I have split up the original patch into two. One is
unchanged (patch 2) and I took a different approach for `removable'
crash.
Mikhail has reported the following VM_BUG_ON triggered when reading sysfs
removable state of a memory block:
page:000003d08300c000 is uninitialized and poisoned
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
Call Trace:
is_mem_section_removable+0xb4/0x190
show_mem_removable+0x9a/0xd8
dev_attr_show+0x34/0x70
sysfs_kf_seq_show+0xc8/0x148
seq_read+0x204/0x480
__vfs_read+0x32/0x178
vfs_read+0x82/0x138
ksys_read+0x5a/0xb0
system_call+0xdc/0x2d8
Last Breaking-Event-Address:
is_mem_section_removable+0xb4/0x190
Kernel panic - not syncing: Fatal exception: panic_on_oops
The reason is that the memory block spans the zone boundary and we are
stumbling over an unitialized struct page. Fix this by enforcing zone
range in is_mem_section_removable so that we never run away from a zone.
Link: http://lkml.kernel.org/r/20190128144506.15603-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Debugged-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa [Fri, 1 Feb 2019 22:20:31 +0000 (14:20 -0800)]
oom, oom_reaper: do not enqueue same task twice
Arkadiusz reported that enabling memcg's group oom killing causes
strange memcg statistics where there is no task in a memcg despite the
number of tasks in that memcg is not 0. It turned out that there is a
bug in wake_oom_reaper() which allows enqueuing same task twice which
makes impossible to decrease the number of tasks in that memcg due to a
refcount leak.
This bug existed since the OOM reaper became invokable from
task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
T1@P1 |T2@P1 |T3@P1 |OOM reaper
----------+----------+----------+------------
# Processing an OOM victim in a different memcg domain.
try_charge()
mem_cgroup_out_of_memory()
mutex_lock(&oom_lock)
try_charge()
mem_cgroup_out_of_memory()
mutex_lock(&oom_lock)
try_charge()
mem_cgroup_out_of_memory()
mutex_lock(&oom_lock)
out_of_memory()
oom_kill_process(P1)
do_send_sig_info(SIGKILL, @P1)
mark_oom_victim(T1@P1)
wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
mutex_unlock(&oom_lock)
out_of_memory()
mark_oom_victim(T2@P1)
wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
mutex_unlock(&oom_lock)
out_of_memory()
mark_oom_victim(T1@P1)
wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
mutex_unlock(&oom_lock)
# Completed processing an OOM victim in a different memcg domain.
spin_lock(&oom_reaper_lock)
# T1P1 is dequeued.
spin_unlock(&oom_reaper_lock)
but memcg's group oom killing made it easier to trigger this bug by
calling wake_oom_reaper() on the same task from one out_of_memory()
request.
Fix this bug using an approach used by commit 855b018325737f76 ("oom,
oom_reaper: disable oom_reaper for oom_kill_allocating_task"). As a
side effect of this patch, this patch also avoids enqueuing multiple
threads sharing memory via task_will_free_mem(current) path.
Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Tested-by: Arkadiusz Miskiewicz <arekm@maven.pl> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Aleksa Sarai <asarai@suse.de> Cc: Jay Kamat <jgkamat@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Fri, 1 Feb 2019 22:20:27 +0000 (14:20 -0800)]
mm: migrate: make buffer_migrate_page_norefs() actually succeed
Currently, buffer_migrate_page_norefs() was constantly failing because
buffer_migrate_lock_buffers() grabbed reference on each buffer. In
fact, there's no reason for buffer_migrate_lock_buffers() to grab any
buffer references as the page is locked during all our operation and
thus nobody can reclaim buffers from the page.
So remove grabbing of buffer references which also makes
buffer_migrate_page_norefs() succeed.
Link: http://lkml.kernel.org/r/20190116131217.7226-1-jack@suse.cz Fixes: 89cb0888ca14 "mm: migrate: provide buffer_migrate_page_norefs()" Signed-off-by: Jan Kara <jack@suse.cz> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrei Vagin [Fri, 1 Feb 2019 22:20:24 +0000 (14:20 -0800)]
kernel/exit.c: release ptraced tasks before zap_pid_ns_processes
Currently, exit_ptrace() adds all ptraced tasks in a dead list, then
zap_pid_ns_processes() waits on all tasks in a current pidns, and only
then are tasks from the dead list released.
zap_pid_ns_processes() can get stuck on waiting tasks from the dead
list. In this case, we will have one unkillable process with one or
more dead children.
Thanks to Oleg for the advice to release tasks in find_child_reaper().
Link: http://lkml.kernel.org/r/20190110175200.12442-1-avagin@gmail.com Fixes: 7c8bd2322c7f ("exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()") Signed-off-by: Andrei Vagin <avagin@gmail.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qian Cai [Fri, 1 Feb 2019 22:20:20 +0000 (14:20 -0800)]
x86_64: increase stack size for KASAN_EXTRA
If the kernel is configured with KASAN_EXTRA, the stack size is
increasted significantly because this option sets "-fstack-reuse" to
"none" in GCC [1]. As a result, it triggers stack overrun quite often
with 32k stack size compiled using GCC 8. For example, this reproducer
triggers a "corrupted stack end detected inside scheduler" very reliably
with CONFIG_SCHED_STACK_END_CHECK enabled.
There are just too many functions that could have a large stack with
KASAN_EXTRA due to large local variables that have been called over and
over again without being able to reuse the stacks. Some noticiable ones
are
There are other 49 functions are over 2k in size while compiling kernel
with "-Wframe-larger-than=" even with a related minimal config on this
machine. Hence, it is too much work to change Makefiles for each object
to compile without "-fsanitize-address-use-after-scope" individually.
Although there is a patch in GCC 9 to help the situation, GCC 9 probably
won't be released in a few months and then it probably take another
6-month to 1-year for all major distros to include it as a default.
Hence, the stack usage with KASAN_EXTRA can be revisited again in 2020
when GCC 9 is everywhere. Until then, this patch will help users avoid
stack overrun.
This has already been fixed for arm64 for the same reason via 6e8830674ea ("arm64: kasan: Increase stack size for KASAN_EXTRA").
Link: http://lkml.kernel.org/r/20190109215209.2903-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrea Arcangeli [Fri, 1 Feb 2019 22:20:16 +0000 (14:20 -0800)]
mm/hugetlb.c: teach follow_hugetlb_page() to handle FOLL_NOWAIT
hugetlb needs the same fix as faultin_nopage (which was applied in
commit 96312e61282a ("mm/gup.c: teach get_user_pages_unlocked to handle
FOLL_NOWAIT")) or KVM hangs because it thinks the mmap_sem was already
released by hugetlb_fault() if it returned VM_FAULT_RETRY, but it wasn't
in the FOLL_NOWAIT case.
Link: http://lkml.kernel.org/r/20190109020203.26669-2-aarcange@redhat.com Fixes: ce53053ce378 ("kvm: switch get_user_page_nowait() to get_user_pages_unlocked()") Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Reported-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The newly added riscv correctly creates the asm-generic wrapper
in the kernel space, but the others (c6x, h8300, hexagon, m68k,
microblaze, openrisc, unicore32) create the one in the uapi directory.
Digging into the git history, now I guess fcc8487d477a ("uapi:
export all headers under uapi directories") was the misconversion.
Prior to that commit, no architecture exported to shmparam.h
As its commit description said, that commit exported shmparam.h
for c6x, h8300, hexagon, m68k, openrisc, unicore32.
Oscar Salvador [Fri, 1 Feb 2019 22:19:57 +0000 (14:19 -0800)]
mm, memory_hotplug: don't bail out in do_migrate_range() prematurely
do_migrate_range() takes a memory range and tries to isolate the pages
to put them into a list. This list will be later on used in
migrate_pages() to know the pages we need to migrate.
Currently, if we fail to isolate a single page, we put all already
isolated pages back to their LRU and we bail out from the function.
This is quite suboptimal, as this will force us to start over again
because scan_movable_pages will give us the same range. If there is no
chance that we can isolate that page, we will loop here forever.
Issue debugged in [1] has proved that. During the debugging of that
issue, it was noticed that if do_migrate_ranges() fails to isolate a
single page, we will just discard the work we have done so far and bail
out, which means that scan_movable_pages() will find again the same set
of pages.
Instead, we can just skip the error, keep isolating as much pages as
possible and then proceed with the call to migrate_pages().
This will allow us to do as much work as possible at once.
[1] https://lkml.org/lkml/2018/12/6/324
Michal said:
: I still think that this doesn't give us a whole picture. Looping for
: ever is a bug. Failing the isolation is quite possible and it should
: be a ephemeral condition (e.g. a race with freeing the page or
: somebody else isolating the page for whatever reason). And here comes
: the disadvantage of the current implementation. We simply throw
: everything on the floor just because of a ephemeral condition. The
: racy page_count check is quite dubious to prevent from that.
Link: http://lkml.kernel.org/r/20181211135312.27034-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 1 Feb 2019 18:39:24 +0000 (10:39 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Still not much going on, the usual set of oops and driver fixes this
time:
- Fix two uapi breakage regressions in mlx5 drivers
- Various oops fixes in hfi1, mlx4, umem, uverbs, and ipoib
- A protocol bug fix for hfi1 preventing it from implementing the
verbs API properly, and a compatability fix for EXEC STACK user
programs
- Fix missed refcounting in the 'advise_mr' patches merged this
cycle.
- Fix wrong use of the uABI in the hns SRQ patches merged this cycle"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
IB/uverbs: Fix OOPs in uverbs_user_mmap_disassociate
IB/ipoib: Fix for use-after-free in ipoib_cm_tx_start
IB/uverbs: Fix ioctl query port to consider device disassociation
RDMA/mlx5: Fix flow creation on representors
IB/uverbs: Fix OOPs upon device disassociation
RDMA/umem: Add missing initialization of owning_mm
RDMA/hns: Update the kernel header file of hns
IB/mlx5: Fix how advise_mr() launches async work
RDMA/device: Expose ib_device_try_get(()
IB/hfi1: Add limit test for RC/UC send via loopback
IB/hfi1: Remove overly conservative VM_EXEC flag check
IB/{hfi1, qib}: Fix WC.byte_len calculation for UD_SEND_WITH_IMM
IB/mlx4: Fix using wrong function to destroy sqp AHs under SRIOV
RDMA/mlx5: Fix check for supported user flags when creating a QP
Linus Torvalds [Fri, 1 Feb 2019 18:30:18 +0000 (10:30 -0800)]
Merge tag 'iomap-5.0-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap fixes from Darrick Wong:
"A couple of iomap fixes to eliminate some memory corruption and hang
problems that were reported:
- fix page migration when using iomap for pagecache management
- fix a use-after-free bug in the directio code"
* tag 'iomap-5.0-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: fix a use after free in iomap_dio_rw
iomap: get/put the page in iomap_page_create/release()
Linus Torvalds [Fri, 1 Feb 2019 18:23:39 +0000 (10:23 -0800)]
Merge tag 'pm-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix a PM-runtime framework regression introduced by the recent
switch-over of device autosuspend to hrtimers and a mistake in the
"poll idle state" code introduced by a recent change in it.
Specifics:
- Since ktime_get() turns out to be problematic for device
autosuspend in the PM-runtime framework, make it use
ktime_get_mono_fast_ns() instead (Vincent Guittot).
- Fix an initial value of a local variable in the "poll idle state"
code that makes it behave not exactly as expected when all idle
states except for the "polling" one are disabled (Doug Smythies)"
* tag 'pm-5.0-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpuidle: poll_state: Fix default time limit
PM-runtime: Fix deadlock with ktime_get()