www.infradead.org Git - users/jedix/linux-maple.git/log

mm, slab: simplify lockdep_assert_held in lockdep_assert_held()

Sebastian reports [1] that the special version of lockdep_assert_held() for a
local lock with PREEMPT_RT is no longer necessary, and we can simplify.

[1] https://lore.kernel.org/linux-mm/20210817153937.hxnuh7mqp6vuiyws@linutronix.de/

This is a fixup for mmotm patch
mm-slub-convert-kmem_cpu_slab-protection-to-local_lock.patch

Link: https://lkml.kernel.org/r/7e9ccf34-57d1-786b-2dfd-3b9ba78e1b32@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: convert kmem_cpu_slab protection to local_lock

Embed local_lock into struct kmem_cpu_slab and use the irq-safe versions
of local_lock instead of plain local_irq_save/restore.  On !PREEMPT_RT
that's equivalent, with better lockdep visibility.  On PREEMPT_RT that
means better preemption.

However, the cost on PREEMPT_RT is the loss of lockless fast paths which
only work with cpu freelist.  Those are designed to detect and recover
from being preempted by other conflicting operations (both fast or slow
path), but the slow path operations assume they cannot be preempted by a
fast path operation, which is guaranteed naturally with disabled irqs.
With local locks on PREEMPT_RT, the fast paths now also need to take the
local lock to avoid races.

In the allocation fastpath slab_alloc_node() we can just defer to the
slowpath __slab_alloc() which also works with cpu freelist, but under the
local lock.  In the free fastpath do_slab_free() we have to add a new
local lock protected version of freeing to the cpu freelist, as the
existing slowpath only works with the page freelist.

Also update the comment about locking scheme in SLUB to reflect changes
done by this series.

[efault@gmx.de: use local_lock() without irq in PREEMPT_RT scope, debugging of RT crashes resulting in put_cpu_partial() locking changes]
Link: https://lkml.kernel.org/r/20210805152000.12817-36-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: use migrate_disable() on PREEMPT_RT

We currently use preempt_disable() (directly or via get_cpu_ptr()) to
stabilize the pointer to kmem_cache_cpu. On PREEMPT_RT this would be
incompatible with the list_lock spinlock. We can use migrate_disable()
instead, but that increases overhead on !PREEMPT_RT as it's an
unconditional function call.

In order to get the best available mechanism on both PREEMPT_RT and
!PREEMPT_RT, introduce private slub_get_cpu_ptr() and slub_put_cpu_ptr()
wrappers and use them.

Link: https://lkml.kernel.org/r/20210805152000.12817-35-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg

Jann Horn reported [1] the following theoretically possible race:

  task A: put_cpu_partial() calls preempt_disable()
  task A: oldpage = this_cpu_read(s->cpu_slab->partial)
  interrupt: kfree() reaches unfreeze_partials() and discards the page
  task B (on another CPU): reallocates page as page cache
  task A: reads page->pages and page->pobjects, which are actually
  halves of the pointer page->lru.prev
  task B (on another CPU): frees page
  interrupt: allocates page as SLUB page and places it on the percpu partial list
  task A: this_cpu_cmpxchg() succeeds

  which would cause page->pages and page->pobjects to end up containing
  halves of pointers that would then influence when put_cpu_partial()
  happens and show up in root-only sysfs files. Maybe that's acceptable,
  I don't know. But there should probably at least be a comment for now
  to point out that we're reading union fields of a page that might be
  in a completely different state.

Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only
safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the
latter disables irqs, otherwise a __slab_free() in an irq handler could
call put_cpu_partial() in the middle of ___slab_alloc() manipulating
->partial and corrupt it.  This becomes an issue on RT after a local_lock
is introduced in later patch.  The fix means taking the local_lock also in
put_cpu_partial() on RT.

After debugging this issue, Mike Galbraith suggested [2] that to avoid
different locking schemes on RT and !RT, we can just protect
put_cpu_partial() with disabled irqs (to be converted to
local_lock_irqsave() later) everywhere.  This should be acceptable as it's
not a fast path, and moving the actual partial unfreezing outside of the
irq disabled section makes it short, and with the retry loop gone the code
can be also simplified.  In addition, the race reported by Jann should no
longer be possible.

[1] https://lore.kernel.org/lkml/CAG48ez1mvUuXwg0YPH5ANzhQLpbphqk-ZS+jbRz+H66fvm4FcA@mail.gmail.com/
[2] https://lore.kernel.org/linux-rt-users/e3470ab357b48bccfbd1f5133b982178a7d2befb.camel@gmx.de/

Link: https://lkml.kernel.org/r/20210805152000.12817-34-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Mike Galbraith <efault@gmx.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: make slab_lock() disable irqs with PREEMPT_RT

We need to disable irqs around slab_lock() (a bit spinlock) to make it
irq-safe.  The calls to slab_lock() are nested under spin_lock_irqsave()
which doesn't disable irqs on PREEMPT_RT, so add explicit disabling with
PREEMPT_RT.

We also distinguish cmpxchg_double_slab() where we do the disabling
explicitly and __cmpxchg_double_slab() for contexts with already disabled
irqs.  However these context are also typically spin_lock_irqsave() thus
insufficient on PREEMPT_RT.  Thus, change __cmpxchg_double_slab() to be
same as cmpxchg_double_slab() on PREEMPT_RT.

Link: https://lkml.kernel.org/r/20210805152000.12817-33-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: optionally save/restore irqs in slab_[un]lock()/

For PREEMPT_RT we will need to disable irqs for this bit spinlock. As a
preparation, add a flags parameter, and an internal version that takes
additional bool parameter to control irq saving/restoring (the flags
parameter is compile-time unused if the bool is a constant false).

Convert ___cmpxchg_double_slab(), which also comes with the same bool
parameter, to use the internal version.

Link: https://lkml.kernel.org/r/20210805152000.12817-32-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm: slub: make object_map_lock a raw_spinlock_t

The variable object_map is protected by object_map_lock. The lock is
always acquired in debug code and within already atomic context

Make object_map_lock a raw_spinlock_t.

Link: https://lkml.kernel.org/r/20210805152000.12817-31-vbabka@suse.cz
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: fix memory and cpu hotplug related lock ordering issues - fix

Make __kmem_cache_do_shrink static to silence "no previous prototype" warning.

Link: https://lkml.kernel.org/r/60e50d8c-ccfa-1036-5a03-2b1a9d25f958@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: fix memory and cpu hotplug related lock ordering issues

Qian Cai reported [1] a lockdep splat on memory offline.

[   91.374541] WARNING: possible circular locking dependency detected
[   91.381411] 5.14.0-rc5-next-20210809+ #84 Not tainted
[   91.387149] ------------------------------------------------------
[   91.394016] lsbug/1523 is trying to acquire lock:
[   91.399406] ffff800018e76530 (flush_lock){+.+.}-{3:3}, at: flush_all+0x50/0x1c8
[   91.407425] but task is already holding lock:
[   91.414638] ffff800018e48468 (slab_mutex){+.+.}-{3:3}, at: slab_memory_callback+0x44/0x280
[   91.423603] which lock already depends on the new lock.

To fix it, we need to change the order in flush_all() so that
cpus_read_lock() is first and mutex_lock(&flush_lock) second.

Also when called from slab_mem_going_offline_callback() we are already
under cpus_read_lock() and cannot take it again, so create a
flush_all_cpus_locked() variant and decouple flushing from actual
shrinking for this call path.

Additionally, Mike Galbraith reported [2] wrong order of cpus_read_lock()
and slab_mutex in kmem_cache_destroy() path and proposed a fix to reverse
it.

This patch is a fixup for the mmotm patch
mm-slub-move-flush_cpu_slab-invocations-__free_slab-invocations-out-of-irq-context.patch

[1] https://lore.kernel.org/lkml/0b36128c-3e12-77df-85fe-a153a714569b@quicinc.com/
[2] https://lore.kernel.org/lkml/2eb3cf340716c40f03a0a342ab40219b3d1de195.camel@gmx.de/

Link: https://lkml.kernel.org/r/50fe26ba-450b-af57-506d-438f67cfbce3@suse.cz
Reported-by: Qian Cai <quic_qiancai@quicinc.com>
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context

flush_all() flushes a specific SLAB cache on each CPU (where the cache is
present). The deactivate_slab()/__free_slab() invocation happens within
IPI handler and is problematic for PREEMPT_RT.

The flush operation is not a frequent operation or a hot path. The
per-CPU flush operation can be moved to within a workqueue.

[vbabka@suse.cz: adapt to new SLUB changes]
Link: https://lkml.kernel.org/r/20210805152000.12817-30-vbabka@suse.cz
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slab: make flush_slab() possible to call with irqs enabled

Currently flush_slab() is always called with disabled IRQs if it's needed,
but the following patches will change that, so add a parameter to control
IRQ disabling within the function, which only protects the kmem_cache_cpu
manipulation and not the call to deactivate_slab() which doesn't need it.

Link: https://lkml.kernel.org/r/20210805152000.12817-29-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: don't disable irqs in slub_cpu_dead()

slub_cpu_dead() cleans up for an offlined cpu from another cpu and calls
only functions that are now irq safe, so we don't need to disable irqs
anymore.

Link: https://lkml.kernel.org/r/20210805152000.12817-28-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: only disable irq with spin_lock in __unfreeze_partials()

__unfreeze_partials() no longer needs to have irqs disabled, except for
making the spin_lock operations irq-safe, so convert the spin_locks
operations and remove the separate irq handling.

Link: https://lkml.kernel.org/r/20210805152000.12817-27-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing

Unfreezing partial list can be split to two phases - detaching the list
from struct kmem_cache_cpu, and processing the list.  The whole operation
does not need to be protected by disabled irqs.  Restructure the code to
separate the detaching (with disabled irqs) and unfreezing (with irq
disabling to be reduced in the next patch).

Also, unfreeze_partials() can be called from another cpu on behalf of a
cpu that is being offlined, where disabling irqs on the local cpu has no
sense, so restructure the code as follows:

- __unfreeze_partials() is the bulk of unfreeze_partials() that processes the
  detached percpu partial list
- unfreeze_partials() detaches list from current cpu with irqs disabled and
  calls __unfreeze_partials()
- unfreeze_partials_cpu() is to be called for the offlined cpu so it needs no
  irq disabling, and is called from __flush_cpu_slab()
- flush_cpu_slab() is for the local cpu thus it needs to call
  unfreeze_partials(). So it can't simply call
  __flush_cpu_slab(smp_processor_id()) anymore and we have to open-code the
  proper calls.

Link: https://lkml.kernel.org/r/20210805152000.12817-26-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: detach whole partial list at once in unfreeze_partials()

Instead of iterating through the live percpu partial list, detach it from
the kmem_cache_cpu at once. This is simpler and will allow further
optimization.

Link: https://lkml.kernel.org/r/20210805152000.12817-25-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: discard slabs in unfreeze_partials() without irqs disabled

No need for disabled irqs when discarding slabs, so restore them before
discarding.

Link: https://lkml.kernel.org/r/20210805152000.12817-24-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: move irq control into unfreeze_partials()

unfreeze_partials() can be optimized so that it doesn't need irqs disabled
for the whole time. As the first step, move irq control into the function
and remove it from the put_cpu_partial() caller.

Link: https://lkml.kernel.org/r/20210805152000.12817-23-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: call deactivate_slab() without disabling irqs

The function is now safe to be called with irqs enabled, so move the calls
outside of irq disabled sections.

When called from ___slab_alloc() -> flush_slab() we have irqs disabled, so
to reenable them before deactivate_slab() we need to open-code
flush_slab() in ___slab_alloc() and reenable irqs after modifying the
kmem_cache_cpu fields. But that means a IRQ handler meanwhile might have
assigned a new page to kmem_cache_cpu.page so we have to retry the whole
check.

The remaining callers of flush_slab() are the IPI handler which has
disabled irqs anyway, and slub_cpu_dead() which will be dealt with in the
following patch.

Link: https://lkml.kernel.org/r/20210805152000.12817-22-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: make locking in deactivate_slab() irq-safe

dectivate_slab() now no longer touches the kmem_cache_cpu structure, so it
will be possible to call it with irqs enabled. Just convert the spin_lock
calls to their irq saving/restoring variants to make it irq-safe.

Note we now have to use cmpxchg_double_slab() for irq-safe slab_lock(),
because in some situations we don't take the list_lock, which would
disable irqs.

Link: https://lkml.kernel.org/r/20210805152000.12817-21-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: move reset of c->page and freelist out of deactivate_slab()

deactivate_slab() removes the cpu slab by merging the cpu freelist with
slab's freelist and putting the slab on the proper node's list. It also
sets the respective kmem_cache_cpu pointers to NULL.

By extracting the kmem_cache_cpu operations from the function, we can make
it not dependent on disabled irqs.

Also if we return a single free pointer from ___slab_alloc, we no longer
have to assign kmem_cache_cpu.page before deactivation or care if somebody
preempted us and assigned a different page to our kmem_cache_cpu in the
process.

Link: https://lkml.kernel.org/r/20210805152000.12817-20-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: stop disabling irqs around get_partial()

The function get_partial() does not need to have irqs disabled as a whole.
It's sufficient to convert spin_lock operations to their irq
saving/restoring versions.

As a result, it's now possible to reach the page allocator from the slab
allocator without disabling and re-enabling interrupts on the way.

Link: https://lkml.kernel.org/r/20210805152000.12817-19-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: check new pages with restored irqs

Building on top of the previous patch, re-enable irqs before checking new
pages. alloc_debug_processing() is now called with enabled irqs so we
need to remove VM_BUG_ON(!irqs_disabled()); in check_slab() - there
doesn't seem to be a need for it anyway.

Link: https://lkml.kernel.org/r/20210805152000.12817-18-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: validate slab from partial list or page allocator before making it cpu slab

When we obtain a new slab page from node partial list or page allocator,
we assign it to kmem_cache_cpu, perform some checks, and if they fail, we
undo the assignment.

In order to allow doing the checks without irq disabled, restructure the
code so that the checks are done first, and kmem_cache_cpu.page assignment
only after they pass.

Link: https://lkml.kernel.org/r/20210805152000.12817-17-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: restore irqs around calling new_slab()

allocate_slab() currently re-enables irqs before calling to the page
allocator.  It depends on gfpflags_allow_blocking() to determine if it's
safe to do so.  Now we can instead simply restore irq before calling it
through new_slab().  The other caller early_kmem_cache_node_alloc() is
unaffected by this.

Link: https://lkml.kernel.org/r/20210805152000.12817-16-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: move disabling irqs closer to get_partial() in ___slab_alloc()

Continue reducing the irq disabled scope. Check for per-cpu partial slabs
with first with irqs enabled and then recheck with irqs disabled before
grabbing the slab page. Mostly preparatory for the following patches.

Link: https://lkml.kernel.org/r/20210805152000.12817-15-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm-slub-do-initial-checks-in-___slab_alloc-with-irqs-enabled-fix-fix

fix renaming snafu

Link: https://lkml.kernel.org/r/ec98bce0-fef4-0fbc-2067-e358510e0321@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Clark Williams <williams@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: prevent VM_BUG_ON in PageSlabPfmemalloc from ___slab_alloc

Clark Williams reported [1] a VM_BUG_ON in PageSlabPfmemalloc:

page:000000009ac5dd73 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1ab3db
flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
raw: 0017ffffc0000000 ffffee1286aceb88 ffffee1287b66288 0000000000000000
raw: 0000000000000000 0000000000100000 00000000ffffffff 0000000000000000
page dumped because: VM_BUG_ON_PAGE(!PageSlab(page))
------------[ cut here ]------------
kernel BUG at include/linux/page-flags.h:814!
invalid opcode: 0000 [#1] PREEMPT_RT SMP PTI
CPU: 3 PID: 12345 Comm: hackbench Not tainted 5.14.0-rc5-rt8+ #12
Hardware name:  /NUC5i7RYB, BIOS RYBDWi35.86A.0359.2016.0906.1028 09/06/2016
RIP: 0010:___slab_alloc+0x340/0x940
Code: c6 48 0f a3 05 b1 7b 57 03 72 99 c7 85 78 ff ff ff ff ff ff ff 48 8b 7d 88 e9 8d fd ff ff 48 c7 c6 50 5a 7c b0 e>
RSP: 0018:ffffba1c4a8b7ab0 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000002 RCX: ffff9bb765118000
RDX: 0000000000000000 RSI: ffffffffaf426050 RDI: 00000000ffffffff
RBP: ffffba1c4a8b7b70 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff9bb7410d3600
R13: 0000000000400cc0 R14: 00000000001f7770 R15: ffff9bbe76df7770
FS:  00007f474b1be740(0000) GS:ffff9bbe76c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f60c04bdaf8 CR3: 0000000124f3a003 CR4: 00000000003706e0
Call Trace:
  ? __alloc_skb+0x1db/0x270
  ? __alloc_skb+0x1db/0x270
  ? kmem_cache_alloc_node+0xa4/0x2b0
  kmem_cache_alloc_node+0xa4/0x2b0
  __alloc_skb+0x1db/0x270
  alloc_skb_with_frags+0x64/0x250
  sock_alloc_send_pskb+0x260/0x2b0
  ? bpf_lsm_socket_getpeersec_dgram+0xa/0x10
  unix_stream_sendmsg+0x27c/0x550
  ? unix_seqpacket_recvmsg+0x60/0x60
  sock_sendmsg+0xbd/0xd0
  sock_write_iter+0xb9/0x120
  new_sync_write+0x175/0x200
  vfs_write+0x3c4/0x510
  ksys_write+0xc9/0x110
  do_syscall_64+0x3b/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae

The problem is that we are opportunistically checking flags on a page in
irq enabled section.  If we are interrupted and the page is freed, it's
not an issue as we detect it after disabling irqs.  But on kernels with
CONFIG_DEBUG_VM.  The check for PageSlab flag in PageSlabPfmemalloc() can
fail.

Fix this by creating an "unsafe" version of the check that doesn't check
PageSlab.

[1] https://lore.kernel.org/lkml/20210812151803.52f84aaf@theseus.lan/

Link: https://lkml.kernel.org/r/f4756ee5-a7e9-ab02-3aba-1355f77b7c79@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Clark Williams <williams@redhat.com>
Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: do initial checks in ___slab_alloc() with irqs enabled

As another step of shortening irq disabled sections in ___slab_alloc(),
delay disabling irqs until we pass the initial checks if there is a cached
percpu slab and it's suitable for our allocation.

Now we have to recheck c->page after actually disabling irqs as an
allocation in irq handler might have replaced it.

Because we call pfmemalloc_match() as one of the checks, we might hit
VM_BUG_ON_PAGE(!PageSlab(page)) in PageSlabPfmemalloc in case we get
interrupted and the page is freed. Thus introduce a
pfmemalloc_match_unsafe() variant that lacks the PageSlab check.

Link: https://lkml.kernel.org/r/20210805152000.12817-14-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: move disabling/enabling irqs to ___slab_alloc()

Currently __slab_alloc() disables irqs around the whole ___slab_alloc().
This includes cases where this is not needed, such as when the allocation
ends up in the page allocator and has to awkwardly enable irqs back based
on gfp flags.  Also the whole kmem_cache_alloc_bulk() is executed with
irqs disabled even when it hits the __slab_alloc() slow path, and long
periods with disabled interrupts are undesirable.

As a first step towards reducing irq disabled periods, move irq handling
into ___slab_alloc().  Callers will instead prevent the s->cpu_slab percpu
pointer from becoming invalid via get_cpu_ptr(), thus preempt_disable().
This does not protect against modification by an irq handler, which is
still done by disabled irq for most of ___slab_alloc().  As a small
immediate benefit, slab_out_of_memory() from ___slab_alloc() is now called
with irqs enabled.

kmem_cache_alloc_bulk() disables irqs for its fastpath and then re-enables
them before calling ___slab_alloc(), which then disables them at its
discretion.  The whole kmem_cache_alloc_bulk() operation also disables
preemption.

When ___slab_alloc() calls new_slab() to allocate a new page, re-enable
preemption, because new_slab() will re-enable interrupts in contexts that
allow blocking (this will be improved by later patches).

The patch itself will thus increase overhead a bit due to disabled
preemption (on configs where it matters) and increased disabling/enabling
irqs in kmem_cache_alloc_bulk(), but that will be gradually improved in
the following patches.

Note in __slab_alloc() we need to change the #ifdef CONFIG_PREEMPT guard
to CONFIG_PREEMPT_COUNT to make sure preempt disable/enable is properly
paired in all configurations.  On configs without involuntary preemption
and debugging the re-read of kmem_cache_cpu pointer is still compiled out
as it was before.

[efault@gmx.de: fix kmem_cache_alloc_bulk() error path]
Link: https://lkml.kernel.org/r/20210805152000.12817-13-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: simplify kmem_cache_cpu and tid setup

In slab_alloc_node() and do_slab_free() fastpaths we need to guarantee
that our kmem_cache_cpu pointer is from the same cpu as the tid value.
Currently that's done by reading the tid first using this_cpu_read(), then
the kmem_cache_cpu pointer and verifying we read the same tid using the
pointer and plain READ_ONCE().

This can be simplified to just fetching kmem_cache_cpu pointer and then
reading tid using the pointer. That guarantees they are from the same
cpu. We don't need to read the tid using this_cpu_read() because the
value will be validated by this_cpu_cmpxchg_double(), making sure we are
on the correct cpu and the freelist didn't change by anyone preempting us
since reading the tid.

Link: https://lkml.kernel.org/r/20210805152000.12817-12-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: restructure new page checks in ___slab_alloc()

When we allocate slab object from a newly acquired page (from node's
partial list or page allocator), we usually also retain the page as a new
percpu slab.  There are two exceptions - when pfmemalloc status of the
page doesn't match our gfp flags, or when the cache has debugging enabled.

The current code for these decisions is not easy to follow, so restructure
it and add comments.  The new structure will also help with the following
changes.  No functional change.

Link: https://lkml.kernel.org/r/20210805152000.12817-11-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: return slab page from get_partial() and set c->page afterwards

The function get_partial() finds a suitable page on a partial list,
acquires and returns its freelist and assigns the page pointer to
kmem_cache_cpu. In later patch we will need more control over the
kmem_cache_cpu.page assignment, so instead of passing a kmem_cache_cpu
pointer, pass a pointer to a pointer to a page that get_partial() can fill
and the caller can assign the kmem_cache_cpu.page pointer. No functional
change as all of this still happens with disabled IRQs.

Link: https://lkml.kernel.org/r/20210805152000.12817-10-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: dissolve new_slab_objects() into ___slab_alloc()

The later patches will need more fine grained control over individual
actions in ___slab_alloc(), the only caller of new_slab_objects(), so
dissolve it there. This is a preparatory step with no functional change.

The only minor change is moving WARN_ON_ONCE() for using a constructor
together with __GFP_ZERO to new_slab(), which makes it somewhat less
frequent, but still able to catch a development change introducing a
systematic misuse.

Link: https://lkml.kernel.org/r/20210805152000.12817-9-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: extract get_partial() from new_slab_objects()

The later patches will need more fine grained control over individual
actions in ___slab_alloc(), the only caller of new_slab_objects(), so this
is a first preparatory step with no functional change.

This adds a goto label that appears unnecessary at this point, but will be
useful for later changes.

Link: https://lkml.kernel.org/r/20210805152000.12817-8-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: unify cmpxchg_double_slab() and __cmpxchg_double_slab()

These functions differ only in irq disabling in the slow path. We can
create a common function with an extra bool parameter to control the irq
disabling. As the functions are inline and the parameter compile-time
constant, there will be no runtime overhead due to this change.

Also change the DEBUG_VM based irqs disable assert to the more standard
lockdep_assert based one.

Link: https://lkml.kernel.org/r/20210805152000.12817-7-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: remove redundant unfreeze_partials() from put_cpu_partial()

Commit d6e0b7fa1186 ("slub: make dead caches discard free slabs
immediately") introduced cpu partial flushing for kmemcg caches, based on
setting the target cpu_partial to 0 and adding a flushing check in
put_cpu_partial().  This code that sets cpu_partial to 0 was later moved
by c9fc586403e7 ("slab: introduce __kmemcg_cache_deactivate()") and
ultimately removed by 9855609bde03 ("mm: memcg/slab: use a single set of
kmem_caches for all accounted allocations").  However the check and flush
in put_cpu_partial() was never removed, although it's effectively a dead
code.  So this patch removes it.

Note that d6e0b7fa1186 also added preempt_disable()/enable() to
unfreeze_partials() which could be thus also considered unnecessary.  But
further patches will rely on it, so keep it.

Link: https://lkml.kernel.org/r/20210805152000.12817-6-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: don't disable irq for debug_check_no_locks_freed()

In slab_free_hook() we disable irqs around the
debug_check_no_locks_freed() call, which is unnecessary, as irqs are
already being disabled inside the call. This seems to be leftover from
the past where there were more calls inside the irq disabled sections.
Remove the irq disable/enable operations.

Mel noted:
> Looks like it was needed for kmemcheck which went away back in 4.15

Link: https://lkml.kernel.org/r/20210805152000.12817-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: allocate private object map for validate_slab_cache()

validate_slab_cache() is called either to handle a sysfs write, or from a
self-test context. In both situations it's straightforward to preallocate
a private object bitmap instead of grabbing the shared static one meant
for critical sections, so let's do that.

Link: https://lkml.kernel.org/r/20210805152000.12817-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: allocate private object map for debugfs listings

Slub has a static spinlock protected bitmap for marking which objects are
on freelist when it wants to list them, for situations where dynamically
allocating such map can lead to recursion or locking issues, and on-stack
bitmap would be too large.

The handlers of debugfs files alloc_traces and free_traces also currently
use this shared bitmap, but their syscall context makes it straightforward
to allocate a private map before entering locked sections, so switch these
processing paths to use a private bitmap.

Link: https://lkml.kernel.org/r/20210805152000.12817-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm, slub: don't call flush_all() from slab_debug_trace_open()

Patch series "SLUB: reduce irq disabled scope and make it RT compatible", v4.

This series was initially inspired by Mel's pcplist local_lock rewrite,
and also interest to better understand SLUB's locking and the new
primitives and RT variants and implications.  It should make SLUB more
preemption-friendly, especially for RT, hopefully without noticeable
regressions, as the fast paths are not affected.

The RFC/v1 version got basic performance screening by Mel that didn't show
major regressions.  Mike's testing with hackbench of v2 on !RT reported
negligible differences [6]:

virgin(ish) tip
5.13.0.g60ab3ed-tip
          7,320.67 msec task-clock                #    7.792 CPUs utilized            ( +-  0.31% )
           221,215      context-switches          #    0.030 M/sec                    ( +-  3.97% )
            16,234      cpu-migrations            #    0.002 M/sec                    ( +-  4.07% )
            13,233      page-faults               #    0.002 M/sec                    ( +-  0.91% )
    27,592,205,252      cycles                    #    3.769 GHz                      ( +-  0.32% )
     8,309,495,040      instructions              #    0.30  insn per cycle           ( +-  0.37% )
     1,555,210,607      branches                  #  212.441 M/sec                    ( +-  0.42% )
         5,484,209      branch-misses             #    0.35% of all branches          ( +-  2.13% )

           0.93949 +- 0.00423 seconds time elapsed  ( +-  0.45% )
           0.94608 +- 0.00384 seconds time elapsed  ( +-  0.41% ) (repeat)
           0.94422 +- 0.00410 seconds time elapsed  ( +-  0.43% )

5.13.0.g60ab3ed-tip +slub-local-lock-v2r3
          7,343.57 msec task-clock                #    7.776 CPUs utilized            ( +-  0.44% )
           223,044      context-switches          #    0.030 M/sec                    ( +-  3.02% )
            16,057      cpu-migrations            #    0.002 M/sec                    ( +-  4.03% )
            13,164      page-faults               #    0.002 M/sec                    ( +-  0.97% )
    27,684,906,017      cycles                    #    3.770 GHz                      ( +-  0.45% )
     8,323,273,871      instructions              #    0.30  insn per cycle           ( +-  0.28% )
     1,556,106,680      branches                  #  211.901 M/sec                    ( +-  0.31% )
         5,463,468      branch-misses             #    0.35% of all branches          ( +-  1.33% )

           0.94440 +- 0.00352 seconds time elapsed  ( +-  0.37% )
           0.94830 +- 0.00228 seconds time elapsed  ( +-  0.24% ) (repeat)
           0.93813 +- 0.00440 seconds time elapsed  ( +-  0.47% ) (repeat)

RT configs showed some throughput regressions, but that's expected
tradeoff for the preemption improvements through the RT mutex.  It didn't
prevent the v2 to be incorporated to the 5.13 RT tree [7], leading to
testing exposure and bugfixes.

Before the series, SLUB is lockless in both allocation and free fast
paths, but elsewhere, it's disabling irqs for considerable periods of time
- especially in allocation slowpath and the bulk allocation, where IRQs
are re-enabled only when a new page from the page allocator is needed, and
the context allows blocking.  The irq disabled sections can then include
deactivate_slab() which walks a full freelist and frees the slab back to
page allocator or unfreeze_partials() going through a list of percpu
partial slabs.  The RT tree currently has some patches mitigating these,
but we can do much better in mainline too.

Patches 1-6 are straightforward improvements or cleanups that could exist
outside of this series too, but are prerequsities.

Patches 7-10 are also preparatory code changes without functional changes,
but not so useful without the rest of the series.

Patch 11 simplifies the fast paths on systems with preemption, based on
(hopefully correct) observation that the current loops to verify tid are
unnecessary.

Patches 12-21 focus on reducing irq disabled scope in the allocation
slowpath.

Patch 12 moves disabling of irqs into ___slab_alloc() from its callers,
which are the allocation slowpath, and bulk allocation.  Instead these
callers only disable preemption to stabilize the cpu.  The following
patches then gradually reduce the scope of disabled irqs in
___slab_alloc() and the functions called from there.  As of patch 15, the
re-enabling of irqs based on gfp flags before calling the page allocator
is removed from allocate_slab().  As of patch 18, it's possible to reach
the page allocator (in case of existing slabs depleted) without disabling
and re-enabling irqs a single time.

Pathces 22-27 reduce the scope of disabled irqs in functions related to
unfreezing percpu partial slab.

Patch 28 is preparatory.  Patch 29 is adopted from the RT tree and
converts the flushing of percpu slabs on all cpus from using IPI to
workqueue, so that the processing isn't happening with irqs disabled in
the IPI handler.  The flushing is not performance critical so it should be
acceptable.

Patch 30 also comes from RT tree and makes object_map_lock RT compatible.

Patches 31-32 make slab_lock irq-safe on RT where we cannot rely on having
irq disabled from the list_lock spin lock usage.

Patch 33 changes kmem_cache_cpu->partial handling in put_cpu_partial()
from cmpxchg loop to a short irq disabled section, which is used by all
other code modifying the field.  This addresses a theoretical race
scenario pointed out by Jann, and makes the critical section safe wrt with
RT local_lock semantics after the conversion in patch 35.

Patch 34 changes preempt disable to migrate disable, so that the nested
list_lock spinlock is safe to take on RT.  Because migrate_disable() is a
function call even on !RT, a small set of private wrappers is introduced
to keep using the cheaper preempt_disable() on !PREEMPT_RT configurations.

As of this patch, SLUB should be compatible with RT's lock semantics, to
the best of my knowledge.

Finally, patch 35 changes irq disabled sections that protect
kmem_cache_cpu fields in the slow paths, with a local lock.  However on
PREEMPT_RT it means the lockless fast paths can now preempt slow paths
which don't expect that, so the local lock has to be taken also in the
fast paths and they are no longer lockless.  It's up to RT folks to decide
if this is a good tradeoff.  The patch also updates the locking
documentation in the file's comment.

The main results of this series:

* irq disabling is only done for minimum amount of time needed to
  protect the kmem_cache_cpu data and as part of spin lock, local lock and
  bit lock operations to make them irq-safe

* SLUB should be fully PREEMPT_RT compatible

This should have obvious implications for better preemptibility,
especially on RT.

Some details are different than how the previous SLUB RT tree patches were
implemented:

  mm: sl[au]b: Change list_lock to raw_spinlock_t [2] - the SLAB part
  can be dropped as a different patch restricts RT to SLUB anyway.  And
  after this series the list_lock in SLUB is never used with irqs disabled
  before taking the lock so it doesn't have to be converted to
  raw_spinlock_t.

  mm: slub: Move discard_slab() invocations out of IRQ-off sections [3]
  should be unnecessary as this series does move these invocations outside
  irq disabled sections in a different way.

The remaining patches to upstream from the RT tree are small ones related
to KConfig.  The patch that restricts PREEMPT_RT to SLUB (not SLAB or
SLOB) makes sense.  The patch that disables CONFIG_SLUB_CPU_PARTIAL with
PREEMPT_RT could perhaps be re-evaluated as the series addresses some
latency issues with it.

[1] https://lore.kernel.org/lkml/20210524233946.20352-1-vbabka@suse.cz/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0001-mm-sl-au-b-Change-list_lock-to-raw_spinlock_t.patch?h=linux-5.12.y-rt-patches
[3] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0004-mm-slub-Move-discard_slab-invocations-out-of-IRQ-off.patch?h=linux-5.12.y-rt-patches
[4] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0005-mm-slub-Move-flush_cpu_slab-invocations-__free_slab-.patch?h=linux-5.12.y-rt-patches
[5] https://lore.kernel.org/lkml/20210609113903.1421-1-vbabka@suse.cz/
[6] https://lore.kernel.org/lkml/891dc24e38106f8542f4c72831d52dc1a1863ae8.camel@gmx.de
[7] https://lore.kernel.org/linux-rt-users/87tul5p2fa.ffs@nanos.tec.linutronix.de/
[8] https://lore.kernel.org/lkml/20210729132132.19691-1-vbabka@suse.cz/
[9] https://lore.kernel.org/lkml/20210804120522.GD6464@techsingularity.net/

This patch (of 35:

slab_debug_trace_open() can only be called on caches with SLAB_STORE_USER
flag and as with all slub debugging flags, such caches avoid cpu or percpu
partial slabs altogether, so there's nothing to flush.

Link: https://lkml.kernel.org/r/20210805152000.12817-1-vbabka@suse.cz
Link: https://lkml.kernel.org/r/20210805152000.12817-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

arch/csky/kernel/probes/kprobes.c: fix bugon.cocci warnings

Use BUG_ON instead of a if condition followed by BUG.

Generated by: scripts/coccinelle/misc/bugon.cocci

Link: https://lkml.kernel.org/r/alpine.DEB.2.22.394.2107061049150.7197@hadrien
Fixes: 7d37cb2c912d ("lib: fix kconfig dependency on ARCH_WANT_FRAME_POINTERS")
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Julia Lawall <julia.lawall@inria.fr>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Julian Braha <julianbraha@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

ocfs2: fix ocfs2 corrupt when iputting an inode

In this condition, it will cause an bug on error.
ocfs2_mkdir()
  ->ocfs2_mknod()
    ->ocfs2_mknod_locked()
      ->__ocfs2_mknod_locked()
        //Assume inode->i_generation is genN.
        ->inode->i_generation = osb->s_next_generation++;
        // The inode lockres has been initialized.
        ->ocfs2_populate_inode()
        ->ocfs2_create_new_inode_locks()
            ->An error happened, returned value is non-zero
      // free the start_bit x in bg_blkno
      ->ocfs2_free_suballoc_bits()
    ->...  /* Another process execute mkdir success in this place,
              and it occupied the start_bit x in bg_blkno
              which has been freed before. Its inode->i_generation
              is genN + 1 */
    ->iput(inode)
      ->evict()
        ->ocfs2_evict_inode()
          ->ocfs2_delete_inode()
            ->ocfs2_inode_lock()
              ->ocfs2_inode_lock_update()
                /* Bug on here, genN != genN + 1 */
                ->mlog_bug_on_msg(inode->i_generation !=
                  le32_to_cpu(fe->i_generation))

So, we need not to reclaim the inode when the inode->ip_inode_lockres
has been initialized. It will be freed in iput().

Link: http://lkml.kernel.org/r/ef080ca3-5d74-e276-17a1-d9e7c7e662c9@huawei.com
Fixes: b1529a41f777 ("ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an error")
Signed-off-by: Yan Wang <wangyan122@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

ocfs2: clear links count in ocfs2_mknod() if an error occurs

In this condition, the inode can not be wiped when error happened.
ocfs2_mkdir()
  ->ocfs2_mknod()
    ->ocfs2_mknod_locked()
      ->__ocfs2_mknod_locked()
        ->ocfs2_set_links_count() // i_links_count is 2
    -> ... // an error accrue, goto roll_back or leave.
    ->ocfs2_commit_trans()
    ->iput(inode)
      ->evict()
        ->ocfs2_evict_inode()
          ->ocfs2_delete_inode()
            ->ocfs2_inode_lock()
              ->ocfs2_inode_lock_update()
                ->ocfs2_refresh_inode()
                  ->set_nlink();    // inode->i_nlink is 2 now.
            /* if wipe is 0, it will goto bail_unlock_inode */
            ->ocfs2_query_inode_wipe()
              ->if (inode->i_nlink) return; // wipe is 0.
            /* inode can not be wiped */
            ->ocfs2_wipe_inode()
So, we need clear links before the transaction committed.

Link: http://lkml.kernel.org/r/d8147c41-fb2b-bdf7-b660-1f3c8448c33f@huawei.com
Signed-off-by: Yan Wang <wangyan122@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

ocfs2: reflink deadlock when clone file to the same directory simultaneously

Running reflink from multiple nodes simultaneously to clone a file to the
same directory probably triggers a deadlock issue.  For example, there is
a three node ocfs2 cluster, each node mounts the ocfs2 file system to
/mnt/shared, and run the reflink command from each node repeatedly, like

  reflink "/mnt/shared/test" \
  "/mnt/shared/.snapshots/test.`date +%m%d%H%M%S`.`hostname`"
then, reflink command process will be hung on each node, and you
can't list this file system directory.
The problematic reflink command process is blocked at one node,
task:reflink         state:D stack:    0 pid: 1283 ppid:  4154
Call Trace:
  __schedule+0x2fd/0x750
  schedule+0x2f/0xa0
  schedule_timeout+0x1cc/0x310
  ? ocfs2_control_cfu+0x50/0x50 [ocfs2_stack_user]
  ? 0xffffffffc0e3e000
  wait_for_completion+0xba/0x140
  ? wake_up_q+0xa0/0xa0
  __ocfs2_cluster_lock.isra.41+0x3b5/0x820 [ocfs2]
  ? ocfs2_inode_lock_full_nested+0x1fc/0x960 [ocfs2]
  ocfs2_inode_lock_full_nested+0x1fc/0x960 [ocfs2]
  ocfs2_init_security_and_acl+0xbe/0x1d0 [ocfs2]
  ocfs2_reflink+0x436/0x4c0 [ocfs2]
  ? ocfs2_reflink_ioctl+0x2ca/0x360 [ocfs2]
  ocfs2_reflink_ioctl+0x2ca/0x360 [ocfs2]
  ocfs2_ioctl+0x25e/0x670 [ocfs2]
  do_vfs_ioctl+0xa0/0x680
  ksys_ioctl+0x70/0x80
  __x64_sys_ioctl+0x16/0x20
  do_syscall_64+0x5b/0x1e0
The other reflink command processes are blocked at other nodes,
task:reflink         state:D stack:    0 pid:29759 ppid:  4088
Call Trace:
  __schedule+0x2fd/0x750
  schedule+0x2f/0xa0
  schedule_timeout+0x1cc/0x310
  ? ocfs2_control_cfu+0x50/0x50 [ocfs2_stack_user]
  ? 0xffffffffc0b19000
  wait_for_completion+0xba/0x140
  ? wake_up_q+0xa0/0xa0
  __ocfs2_cluster_lock.isra.41+0x3b5/0x820 [ocfs2]
  ? ocfs2_inode_lock_full_nested+0x1fc/0x960 [ocfs2]
  ocfs2_inode_lock_full_nested+0x1fc/0x960 [ocfs2]
  ocfs2_mv_orphaned_inode_to_new+0x87/0x7e0 [ocfs2]
  ocfs2_reflink+0x335/0x4c0 [ocfs2]
  ? ocfs2_reflink_ioctl+0x2ca/0x360 [ocfs2]
  ocfs2_reflink_ioctl+0x2ca/0x360 [ocfs2]
  ocfs2_ioctl+0x25e/0x670 [ocfs2]
  do_vfs_ioctl+0xa0/0x680
  ksys_ioctl+0x70/0x80
  __x64_sys_ioctl+0x16/0x20
  do_syscall_64+0x5b/0x1e0
or
task:reflink         state:D stack:    0 pid:18465 ppid:  4156
Call Trace:
  __schedule+0x302/0x940
  ? usleep_range+0x80/0x80
  schedule+0x46/0xb0
  schedule_timeout+0xff/0x140
  ? ocfs2_control_cfu+0x50/0x50 [ocfs2_stack_user]
  ? 0xffffffffc0c3b000
  __wait_for_common+0xb9/0x170
  __ocfs2_cluster_lock.constprop.0+0x1d6/0x860 [ocfs2]
  ? ocfs2_wait_for_recovery+0x49/0xd0 [ocfs2]
  ? ocfs2_inode_lock_full_nested+0x30f/0xa50 [ocfs2]
  ocfs2_inode_lock_full_nested+0x30f/0xa50 [ocfs2]
  ocfs2_inode_lock_tracker+0xf2/0x2b0 [ocfs2]
  ? dput+0x32/0x2f0
  ocfs2_permission+0x45/0xe0 [ocfs2]
  inode_permission+0xcc/0x170
  link_path_walk.part.0.constprop.0+0x2a2/0x380
  ? path_init+0x2c1/0x3f0
  path_parentat+0x3c/0x90
  filename_parentat+0xc1/0x1d0
  ? filename_lookup+0x138/0x1c0
  filename_create+0x43/0x160
  ocfs2_reflink_ioctl+0xe6/0x380 [ocfs2]
  ocfs2_ioctl+0x1ea/0x2c0 [ocfs2]
  ? do_sys_openat2+0x81/0x150
  __x64_sys_ioctl+0x82/0xb0
  do_syscall_64+0x61/0xb0

The deadlock is caused by multiple acquiring the destination directory
inode dlm lock in ocfs2_reflink function, we should acquire this directory
inode dlm lock at the beginning, and hold this dlm lock until end of the
function.

Link: https://lkml.kernel.org/r/20210729110230.18983-1-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

ocfs2: quota_local: fix possible uninitialized-variable access in ocfs2_local_read_info()

A memory block is allocated through kmalloc(), and its return value is
assigned to the pointer oinfo. However, oinfo->dqi_gqinode is not
initialized but it is accessed in:
iput(oinfo->dqi_gqinode);

To fix this possible uninitialized-variable access, assign NULL to
oinfo->dqi_gqinode, and add ocfs2_qinfo_lock_res_init() behind the
assignment in ocfs2_local_read_info(). Remove ocfs2_qinfo_lock_res_init()
in ocfs2_global_read_info().

Link: https://lkml.kernel.org/r/20210804031832.57154-1-islituo@gmail.com
Signed-off-by: Tuo Li <islituo@gmail.com>
Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

ocfs2: remove an unnecessary condition

The case where "tmp_oh" is NULL is handled at the start of the function.
At this point we know it's non-NULL so this will always return 1.

Link: https://lkml.kernel.org/r/YOcItgIXtisi3MaO@mwanda
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Larry Chen <lchen@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

ia64: fix typo in a comment

s/when when/when/

Link: https://lkml.kernel.org/r/20210817112500.12848-1-wangborong@cdjrlc.com
Signed-off-by: Jason Wang <wangborong@cdjrlc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

procfs: prevent unpriveleged processes accessing fdinfo dir

The file permissions on the fdinfo dir from were changed from
S_IRUSR|S_IXUSR to S_IRUGO|S_IXUGO, and a PTRACE_MODE_READ check was added
for opening the fdinfo files [1]. However, the ptrace permission check
was not added to the directory, allowing anyone to get the open FD numbers
by reading the fdinfo directory.

Add the missing ptrace permission check for opening the fdinfo directory.

[1] https://lkml.kernel.org/r/20210308170651.919148-1-kaleshsingh@google.com

Link: https://lkml.kernel.org/r/20210713162008.1056986-1-kaleshsingh@google.com
Fixes: 7bc3fa0172a4 ("procfs: allow reading fdinfo with PTRACE_MODE_READ")
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

/proc/kpageflags: do not use uninitialized struct pages

A kernel panic was observed during reading /proc/kpageflags for first few
pfns allocated by pmem namespace:

BUG: unable to handle page fault for address: fffffffffffffffe
[  114.495280] #PF: supervisor read access in kernel mode
[  114.495738] #PF: error_code(0x0000) - not-present page
[  114.496203] PGD 17120e067 P4D 17120e067 PUD 171210067 PMD 0
[  114.496713] Oops: 0000 [#1] SMP PTI
[  114.497037] CPU: 9 PID: 1202 Comm: page-types Not tainted 5.3.0-rc1 #1
[  114.497621] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
[  114.498706] RIP: 0010:stable_page_flags+0x27/0x3f0
[  114.499142] Code: 82 66 90 66 66 66 66 90 48 85 ff 0f 84 d1 03 00 00 41 54 55 48 89 fd 53 48 8b 57 08 48 8b 1f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 02 0f 84 57 03 00 00 45 31 e4 48 8b 55 08 48 89 ef
[  114.500788] RSP: 0018:ffffa5e601a0fe60 EFLAGS: 00010202
[  114.501373] RAX: fffffffffffffffe RBX: ffffffffffffffff RCX: 0000000000000000
[  114.502009] RDX: 0000000000000001 RSI: 00007ffca13a7310 RDI: ffffd07489000000
[  114.502637] RBP: ffffd07489000000 R08: 0000000000000001 R09: 0000000000000000
[  114.503270] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000240000
[  114.503896] R13: 0000000000080000 R14: 00007ffca13a7310 R15: ffffa5e601a0ff08
[  114.504530] FS:  00007f0266c7f540(0000) GS:ffff962dbbac0000(0000) knlGS:0000000000000000
[  114.505245] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  114.505754] CR2: fffffffffffffffe CR3: 000000023a204000 CR4: 00000000000006e0
[  114.506401] Call Trace:
[  114.506660]  kpageflags_read+0xb1/0x130
[  114.507051]  proc_reg_read+0x39/0x60
[  114.507387]  vfs_read+0x8a/0x140
[  114.507686]  ksys_pread64+0x61/0xa0
[  114.508021]  do_syscall_64+0x5f/0x1a0
[  114.508372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  114.508844] RIP: 0033:0x7f0266ba426b

The reason for the panic is that stable_page_flags() which parses the page
flags uses uninitialized struct pages reserved by the ZONE_DEVICE driver.

Earlier approach to fix this was discussed here:
https://marc.info/?l=linux-mm&m=152964770000672&w=2

This is another approach.  To avoid using the uninitialized struct page,
immediately return with KPF_RESERVED at the beginning of
stable_page_flags() if the page is reserved by ZONE_DEVICE driver.

Dan said:

: The nvdimm implementation uses vmem_altmap to arrange for the 'struct
: page' array to be allocated from a reservation of a pmem namespace.  A
: namespace in this mode contains an info-block that consumes the first
: 8K of the namespace capacity, capacity designated for page mapping,
: capacity for padding the start of data to optionally 4K, 2MB, or 1GB
: (on x86), and then the namespace data itself.  The implementation
: specifies a section aligned (now sub-section aligned) address to
: arch_add_memory() to establish the linear mapping to map the metadata,
: and then vmem_altmap indicates to memmap_init_zone() which pfns
: represent data.  The implementation only specifies enough 'struct page'
: capacity for pfn_to_page() to operate on the data space, not the
: namespace metadata space.
:
: The proposal to validate ZONE_DEVICE pfns against the altmap seems the
: right approach to me.

Link: http://lkml.kernel.org/r/20190725023100.31141-3-t-fukasawa@vx.jp.nec.com
Signed-off-by: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Junichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

/proc/kpageflags: prevent an integer overflow in stable_page_flags()

stable_page_flags() returns kpageflags info in u64, but it uses "1 <<
KPF_*" internally which is considered as int. This type mismatch causes
no visible problem now, but it will if you set bit 32 or more as done in a
subsequent patch. So use BIT_ULL in order to avoid future overflow
issues.

Link: http://lkml.kernel.org/r/20190725023100.31141-2-t-fukasawa@vx.jp.nec.com
Signed-off-by: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Junichi Nomura <j-nomura@ce.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

mm/filemap.c: remove bogus VM_BUG_ON

It is not safe to check page->index without holding the page lock. It can
be changed if the page is moved between the swap cache and the page cache
for a shmem file, for example. There is a VM_BUG_ON below which checks
page->index is correct after taking the page lock.

Link: https://lkml.kernel.org/r/20210818144932.940640-1-willy@infradead.org
Fixes: 5c211ba29deb ("mm: add and use find_lock_entries")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: <syzbot+c87be4f669d920c76330@syzkaller.appspotmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

MAINTAINERS: exfat: update my email address

My email address in exfat entry will be not available in a few days.
Update it to my own kernel.org address.

Link: https://lkml.kernel.org/r/20210825044833.16806-1-namjae.jeon@samsung.com
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/memory_hotplug: fix potential permanent lru cache disable

If offline_pages failed after lru_cache_disable(), it forgot to do
lru_cache_enable() in error path. So we would have lru cache disabled
permanently in this case.

Link: https://lkml.kernel.org/r/20210821094246.10149-3-linmiaohe@huawei.com
Fixes: d479960e44f2 ("mm: disable LRU pagevec during the migration temporarily")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Chris Goldsworthy <cgoldswo@codeaurora.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Jason Gunthorpe:
"Several small fixes, the first three are significant:

   - mlx5 crash unloading drivers with a rare HW config

   - missing userspace reporting for the new dmabuf objects

   - random rxe failure due to missing memory zeroing

   - static checker/etc reports: missing spin lock init, null pointer
     deref on error, extra unlock on error path, memory allocation under
     spinlock, missing IRQ vector cleanup

   - kconfig typo in the new irdma driver"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  RDMA/rxe: Zero out index member of struct rxe_queue
  RDMA/efa: Free IRQ vectors on error flow
  RDMA/rxe: Fix memory allocation while in a spin lock
  RDMA/bnxt_re: Remove unpaired rtnl unlock in bnxt_re_dev_init()
  IB/hfi1: Fix possible null-pointer dereference in _extend_sdma_tx_descs()
  RDMA/irdma: Use correct kconfig symbol for AUXILIARY_BUS
  RDMA/bnxt_re: Add missing spin lock initialization
  RDMA/uverbs: Track dmabuf memory regions
  RDMA/mlx5: Fix crash when unbind multiport slave

Revert "media: dvb header files: move some headers to staging"

This reverts commit 819fbd3d8ef36c09576c2a0ffea503f5c46e9177.

It turns out that some user-space applications use these uapi header
files, so even though the only user of the interface is an old driver
that was moved to staging, moving the header files causes unnecessary
pain.

Generally, we really don't want user space to use kernel headers
directly (exactly because it causes pain when we re-organize), and
instead copy them as needed. But these things happen, and the headers
were in the uapi directory, so I guess it's not entirely unreasonable.

Link: https://lore.kernel.org/lkml/4e3e0d40-df4a-94f8-7c2d-85010b0873c4@web.de/
Reported-by: Soeren Moch <smoch@web.de>
Cc: stable@kernel.org # 5.13
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Linux 5.14-rc7

Merge tag 'powerpc-5.14-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

- Fix random crashes on some 32-bit CPUs by adding isync() after
   locking/unlocking KUEP

- Fix intermittent crashes when loading modules with strict module RWX

- Fix a section mismatch introduce by a previous fix.

Thanks to Christophe Leroy, Fabiano Rosas, Laurent Vivier, Murilo
Opsfelder Araújo, Nathan Chancellor, and Stan Johnson.

h# -----BEGIN PGP SIGNATURE-----

* tag 'powerpc-5.14-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/mm: Fix set_memory_*() against concurrent accesses
  powerpc/32s: Fix random crashes by adding isync() after locking/unlocking KUEP
  powerpc/xive: Do not mark xive_request_ipi() as __init

Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

Pull clk driver fixes from Stephen Boyd:

- Make the regulator state match the GDSC power domain state at boot on
   Qualcomm SoCs so that the regulator isn't turned off inadvertently.

- Fix earlycon on i.MX6Q SoCs

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: qcom: gdsc: Ensure regulator init state matches GDSC state
  clk: imx6q: fix uart earlycon unwork

Merge tag 'char-misc-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver fixes from Greg KH:
"Here are some small driver fixes for 5.14-rc7.

  They consist of:

   - revert for an interconnect patch that was found to have problems

   - ipack tpci200 driver fixes for reported problems

   - slimbus messaging and ngd fixes for reported problems

  All are small and have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  ipack: tpci200: fix memory leak in the tpci200_register
  ipack: tpci200: fix many double free issues in tpci200_pci_probe
  slimbus: ngd: reset dma setup during runtime pm
  slimbus: ngd: set correct device for pm
  slimbus: messaging: check for valid transaction id
  slimbus: messaging: start transaction ids from 1 instead of zero
  Revert "interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate"

Merge tag 'usb-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB fix from Greg KH:
"Here is a single USB typec tcpm fix for a reported problem for
  5.14-rc7. It showed up in 5.13 and resolves an issue that Hans found.
  It has been in linux-next this week with no reported problems"

* tag 'usb-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  usb: typec: tcpm: Fix VDMs sometimes not being forwarded to alt-mode drivers

Merge tag 'riscv-for-linus-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V fixes from Palmer Dabbelt:

- fix the sifive-l2-cache device tree bindings for json-schema
   compatibility. This does not change the intended behavior of the
   binding.

- avoid improperly freeing necessary resources during early boot.

* tag 'riscv-for-linus-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: Fix a number of free'd resources in init_resources()
  dt-bindings: sifive-l2-cache: Fix 'select' matching

Merge tag 's390-5.14-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 fix from Vasily Gorbik:

- fix use after free of zpci_dev in pci code

* tag 's390-5.14-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/pci: fix use after free of zpci_dev

Merge tag 'locks-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull mandatory file locking deprecation warning from Jeff Layton:
"As discussed on the list, this patch just adds a new warning for folks
  who still have mandatory locking enabled and actually mount with '-o
  mand'. I'd like to get this in for v5.14 so we can push this out into
  stable kernels and hopefully reach folks who have mounts with -o mand.

  For now, I'm operating under the assumption that we'll fully remove
  this support in v5.15, but we can move that out if any legitimate
  users of this facility speak up between now and then"

* tag 'locks-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  fs: warn about impending deprecation of mandatory locks

Merge tag 'block-5.14-2021-08-20' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
"Three fixes from Ming Lei that should go into 5.14:

   - Fix for a kernel panic when iterating over tags for some cases
     where a flush request is present, a regression in this cycle.

   - Request timeout fix

   - Fix flush request checking"

* tag 'block-5.14-2021-08-20' of git://git.kernel.dk/linux-block:
  blk-mq: fix is_flush_rq
  blk-mq: fix kernel panic during iterating over flush request
  blk-mq: don't grab rq's refcount in blk_mq_check_expired()

Merge tag 'io_uring-5.14-2021-08-20' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
"A few small fixes that should go into this release:

   - Fix never re-assigning an initial error value for io_uring_enter()
     for SQPOLL, if asked to do nothing

   - Fix xa_alloc_cycle() return value checking, for cases where we have
     wrapped around

   - Fix for a ctx pin issue introduced in this cycle (Pavel)"

* tag 'io_uring-5.14-2021-08-20' of git://git.kernel.dk/linux-block:
  io_uring: fix xa_alloc_cycle() error return value check
  io_uring: pin ctx on fallback execution
  io_uring: only assign io_uring_enter() SQPOLL error in actual error case

fs: warn about impending deprecation of mandatory locks

We've had CONFIG_MANDATORY_FILE_LOCKING since 2015 and a lot of distros
have disabled it. Warn the stragglers that still use "-o mand" that
we'll be dropping support for that mount option.

Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@kernel.org>

io_uring: fix xa_alloc_cycle() error return value check

We currently check for ret != 0 to indicate error, but '1' is a valid
return and just indicates that the allocation succeeded with a wrap.
Correct the check to be for < 0, like it was before the xarray
conversion.

Cc: stable@vger.kernel.org
Fixes: 61cf93700fe6 ("io_uring: Convert personality_idr to XArray")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge tag 'acpi-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
"These fix two mistakes in new code.

  Specifics:

   - Prevent confusing messages from being printed if the PRMT table is
     not present or there are no PRM modules (Aubrey Li).

   - Fix the handling of suspend-to-idle entry and exit in the case when
     the Microsoft UUID is used with the Low-Power S0 Idle _DSM
     interface (Mario Limonciello)"

* tag 'acpi-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: PM: s2idle: Invert Microsoft UUID entry and exit
  ACPI: PRM: Deal with table not present or no module found

Merge tag 'pm-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
"These fix some issues in the ARM cpufreq drivers and in the operating
  performance points (OPP) framework.

  Specifics:

   - Fix useless WARN() in the OPP core and prevent a noisy warning
     from being printed by OPP _put functions (Dmitry Osipenko).

   - Fix error path when allocation failed in the arm_scmi cpufreq
     driver (Lukasz Luba).

   - Blacklist Qualcomm sc8180x and Qualcomm sm8150 in
     cpufreq-dt-platdev (Bjorn Andersson, Thara Gopinath).

   - Forbid cpufreq for 1.2 GHz variant in the armada-37xx cpufreq
     driver (Marek Behún)"

* tag 'pm-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  opp: Drop empty-table checks from _put functions
  cpufreq: armada-37xx: forbid cpufreq for 1.2 GHz variant
  cpufreq: blocklist Qualcomm sm8150 in cpufreq-dt-platdev
  cpufreq: arm_scmi: Fix error path when allocation failed
  opp: remove WARN when no valid OPPs remain
  cpufreq: blacklist Qualcomm sc8180x in cpufreq-dt-platdev

Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
"10 patches.

  Subsystems affected by this patch series: MAINTAINERS and mm (shmem,
  pagealloc, tracing, memcg, memory-failure, vmscan, kfence, and
  hugetlb)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  hugetlb: don't pass page cache pages to restore_reserve_on_error
  kfence: fix is_kfence_address() for addresses below KFENCE_POOL_SIZE
  mm: vmscan: fix missing psi annotation for node_reclaim()
  mm/hwpoison: retry with shake_page() for unhandlable pages
  mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim
  MAINTAINERS: update ClangBuiltLinux IRC chat
  mmflags.h: add missing __GFP_ZEROTAGS and __GFP_SKIP_KASAN_POISON names
  mm/page_alloc: don't corrupt pcppage_migratetype
  Revert "mm: swap: check if swap backing device is congested or not"
  Revert "mm/shmem: fix shmem_swapin() race with swapoff"

Merge tag 'drm-fixes-2021-08-20-3' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
"Regularly scheduled fixes. The ttm one solves a problem of GPU drivers
  failing to load if debugfs is off in Kconfig, otherwise the i915 and
  mediatek, and amdgpu fixes all fairly normal.

  Nouveau has a couple of display fixes, but it has a fix for a
  longstanding race condition in it's memory manager code, and the fix
  mostly removes some code that wasn't working properly and has no
  userspace users. This fix makes the diffstat kinda larger but in a
  good (negative line-count) way.

  core:
   - fix drm_wait_vblank uapi copying bug

  ttm:
   - fix debugfs init when debugfs is off

  amdgpu:
   - vega10 SMU workload fix
   - DCN VM fix
   - DCN 3.01 watermark fix

  amdkfd:
   - SVM fix

  nouveau:
   - ampere display fixes
   - remove MM misfeature to fix a longstanding race condition

  i915:
   - tweaked display workaround for all PCHs
   - eDP MSO pipe sanity for ADL-P fix
   - remove unused symbol export

  mediatek:
   - AAL output size setting
   - Delete component in remove function"

* tag 'drm-fixes-2021-08-20-3' of git://anongit.freedesktop.org/drm/drm:
  drm/amd/display: Use DCN30 watermark calc for DCN301
  drm/i915/dp: remove superfluous EXPORT_SYMBOL()
  drm/i915/edp: fix eDP MSO pipe sanity checks for ADL-P
  drm/i915: Tweaked Wa_14010685332 for all PCHs
  drm/nouveau: rip out nvkm_client.super
  drm/nouveau: block a bunch of classes from userspace
  drm/nouveau/fifo/nv50-: rip out dma channels
  drm/nouveau/kms/nv50: workaround EFI GOP window channel format differences
  drm/nouveau/disp: power down unused DP links during init
  drm/nouveau: recognise GA107
  drm: Copy drm_wait_vblank to user before returning
  drm/amd/display: Ensure DCN save after VM setup
  drm/amdkfd: fix random KFDSVMRangeTest.SetGetAttributesTest test failure
  drm/amd/pm: change the workload type for some cards
  Revert "drm/amd/pm: fix workload mismatch on vega10"
  drm: ttm: Don't bail from ttm_global_init if debugfs_create_dir fails
  drm/mediatek: Add component_del in OVL and COLOR remove function
  drm/mediatek: Add AAL output size configuration

Merge tag 'pci-v5.14-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI fixes from Bjorn Helgaas:

- Add Rahul Tanwar as Intel LGM Gateway PCIe maintainer (Rahul Tanwar)

- Add Jim Quinlan et al as Broadcom STB PCIe maintainers (Jim Quinlan)

- Increase D3hot-to-D0 delay for AMD Renoir/Cezanne XHCI (Marcin
   Bachry)

- Correct iomem_get_mapping() usage for legacy_mem sysfs (Krzysztof
   Wilczyński)

* tag 'pci-v5.14-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
  PCI/sysfs: Use correct variable for the legacy_mem sysfs object
  PCI: Increase D3 delay for AMD Renoir/Cezanne XHCI
  MAINTAINERS: Add Jim Quinlan et al as Broadcom STB PCIe maintainers
  MAINTAINERS: Add Rahul Tanwar as Intel LGM Gateway PCIe maintainer

Merge tag 'mmc-v5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc

Pull MMC host fixes from Ulf Hansson:

- dw_mmc: Fix hang on data CRC error

- mmci: Fix voltage switch procedure for the stm32 variant

- sdhci-iproc: Fix some clock issues for BCM2711

- sdhci-msm: Fixup software timeout value

* tag 'mmc-v5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: sdhci-iproc: Set SDHCI_QUIRK_CAP_CLOCK_BASE_BROKEN on BCM2711
  mmc: sdhci-iproc: Cap min clock frequency on BCM2711
  mmc: sdhci-msm: Update the software timeout value for sdhc
  mmc: mmci: stm32: Check when the voltage switch procedure should be done
  mmc: dw_mmc: Fix hang on data CRC error

Merge tag 'sound-5.14-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull more sound fixes from Takashi Iwai:
"This is a quick follow up for 5.14: a fix for a very recently
  introduced regression on ASoC Intel Atom driver, and another trivial
  HD-audio quirk for HP laptops"

* tag 'sound-5.14-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ASoC: intel: atom: Fix breakage for PCM buffer address setup
  ALSA: hda/realtek: Limit mic boost on HP ProBook 445 G8

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Will Deacon:

- Fix cleaning of vDSO directories

- Ensure CNTHCTL_EL2 is fully initialised when booting at EL2

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: initialize all of CNTHCTL_EL2
arm64: clean vdso & vdso32 files

Merge branch 'acpi-pm'

* acpi-pm:
ACPI: PM: s2idle: Invert Microsoft UUID entry and exit

Merge tag 'iommu-fixes-v5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull iommu fixes from Joerg Roedel:

- Fix for a potential NULL-ptr dereference in IOMMU core code

- Two resource leak fixes

- Cache flush fix in the Intel VT-d driver

* tag 'iommu-fixes-v5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/vt-d: Fix incomplete cache flush in intel_pasid_tear_down_entry()
  iommu/vt-d: Fix PASID reference leak
  iommu: Check if group is NULL before remove device
  iommu/dma: Fix leak in non-contiguous API

Merge branch 'pm-opp'

* pm-opp:
opp: Drop empty-table checks from _put functions
opp: remove WARN when no valid OPPs remain

RDMA/rxe: Zero out index member of struct rxe_queue

1) New index member of struct rxe_queue was introduced but not zeroed so
the initial value of index may be random.

2) The current index is not masked off to index_mask.

In this case producer_addr() and consumer_addr() will get an invalid
address by the random index and then accessing the invalid address
triggers the following panic:

"BUG: unable to handle page fault for address: ffff9ae2c07a1414"

Fix the issue by using kzalloc() to zero out index member.

Fixes: 5bcf5a59c41e ("RDMA/rxe: Protext kernel index from user space")
Link: https://lore.kernel.org/r/20210820111509.172500-1-yangx.jy@fujitsu.com
Signed-off-by: Xiao Yang <yangx.jy@fujitsu.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

hugetlb: don't pass page cache pages to restore_reserve_on_error

syzbot hit kernel BUG at fs/hugetlbfs/inode.c:532 as described in [1].
This BUG triggers if the HPageRestoreReserve flag is set on a page in
the page cache.  It should never be set, as the routine
huge_add_to_page_cache explicitly clears the flag after adding a page to
the cache.

The only code other than huge page allocation which sets the flag is
restore_reserve_on_error.  It will potentially set the flag in rare out
of memory conditions.  syzbot was injecting errors to cause memory
allocation errors which exercised this specific path.

The code in restore_reserve_on_error is doing the right thing.  However,
there are instances where pages in the page cache were being passed to
restore_reserve_on_error.  This is incorrect, as once a page goes into
the cache reservation information will not be modified for the page
until it is removed from the cache.  Error paths do not remove pages
from the cache, so even in the case of error, the page will remain in
the cache and no reservation adjustment is needed.

Modify routines that potentially call restore_reserve_on_error with a
page cache page to no longer do so.

Note on fixes tag: Prior to commit 846be08578ed ("mm/hugetlb: expand
restore_reserve_on_error functionality") the routine would not process
page cache pages because the HPageRestoreReserve flag is not set on such
pages.  Therefore, this issue could not be trigggered.  The code added
by commit 846be08578ed ("mm/hugetlb: expand restore_reserve_on_error
functionality") is needed and correct.  It exposed incorrect calls to
restore_reserve_on_error which is the root cause addressed by this
commit.

[1] https://lore.kernel.org/linux-mm/00000000000050776d05c9b7c7f0@google.com/

Link: https://lkml.kernel.org/r/20210818213304.37038-1-mike.kravetz@oracle.com
Fixes: 846be08578ed ("mm/hugetlb: expand restore_reserve_on_error functionality")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reported-by: <syzbot+67654e51e54455f1c585@syzkaller.appspotmail.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

kfence: fix is_kfence_address() for addresses below KFENCE_POOL_SIZE

Originally the addr != NULL check was meant to take care of the case
where __kfence_pool == NULL (KFENCE is disabled). However, this does
not work for addresses where addr > 0 && addr < KFENCE_POOL_SIZE.

This can be the case on NULL-deref where addr > 0 && addr < PAGE_SIZE or
any other faulting access with addr < KFENCE_POOL_SIZE. While the
kernel would likely crash, the stack traces and report might be
confusing due to double faults upon KFENCE's attempt to unprotect such
an address.

Fix it by just checking that __kfence_pool != NULL instead.

Link: https://lkml.kernel.org/r/20210818130300.2482437-1-elver@google.com
Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure")
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Acked-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org> [5.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: vmscan: fix missing psi annotation for node_reclaim()

In a debugging session the other day, Rik noticed that node_reclaim()
was missing memstall annotations. This means we'll miss pressure and
lost productivity resulting from reclaim on an overloaded local NUMA
node when vm.zone_reclaim_mode is enabled.

There haven't been any reports, but that's likely because
vm.zone_reclaim_mode hasn't been a commonly used feature recently, and
the intersection between such setups and psi users is probably nil.

But secondary memory such as CXL-connected DIMMS, persistent memory etc,
and the page demotion patches that handle them
(https://lore.kernel.org/lkml/20210401183216.443C4443@viggo.jf.intel.com/)
could soon make this a more common codepath again.

Link: https://lkml.kernel.org/r/20210818152457.35846-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/hwpoison: retry with shake_page() for unhandlable pages

HWPoisonHandlable() sometimes returns false for typical user pages due
to races with average memory events like transfers over LRU lists. This
causes failures in hwpoison handling.

There's retry code for such a case but does not work because the retry
loop reaches the retry limit too quickly before the page settles down to
handlable state. Let get_any_page() call shake_page() to fix it.

[naoya.horiguchi@nec.com: get_any_page(): return -EIO when retry limit reached]
Link: https://lkml.kernel.org/r/20210819001958.2365157-1-naoya.horiguchi@linux.dev
Link: https://lkml.kernel.org/r/20210817053703.2267588-1-naoya.horiguchi@linux.dev
Fixes: 25182f05ffed ("mm,hwpoison: fix race with hugetlb page allocation")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org> [5.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim

We've noticed occasional OOM killing when memory.low settings are in
effect for cgroups.  This is unexpected and undesirable as memory.low is
supposed to express non-OOMing memory priorities between cgroups.

The reason for this is proportional memory.low reclaim.  When cgroups
are below their memory.low threshold, reclaim passes them over in the
first round, and then retries if it couldn't find pages anywhere else.
But when cgroups are slightly above their memory.low setting, page scan
force is scaled down and diminished in proportion to the overage, to the
point where it can cause reclaim to fail as well - only in that case we
currently don't retry, and instead trigger OOM.

To fix this, hook proportional reclaim into the same retry logic we have
in place for when cgroups are skipped entirely.  This way if reclaim
fails and some cgroups were scanned with diminished pressure, we'll try
another full-force cycle before giving up and OOMing.

[akpm@linux-foundation.org: coding-style fixes]

Link: https://lkml.kernel.org/r/20210817180506.220056-1-hannes@cmpxchg.org
Fixes: 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Leon Yang <lnyng@fb.com>
Reviewed-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org> [5.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

MAINTAINERS: update ClangBuiltLinux IRC chat

Everyone has moved from Freenode to Libera so updated the channel entry
for MAINTAINERS.

Link: https://github.com/ClangBuiltLinux/linux/issues/1402
Link: https://lkml.kernel.org/r/20210818022339.3863058-1-nathan@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mmflags.h: add missing __GFP_ZEROTAGS and __GFP_SKIP_KASAN_POISON names

printk("%pGg") outputs these two flags as hexadecimal number, rather
than as a string, e.g:

GFP_KERNEL|0x1800000

Fix this by adding missing names of __GFP_ZEROTAGS and
__GFP_SKIP_KASAN_POISON flags to __def_gfpflag_names.

Link: https://lkml.kernel.org/r/20210816133502.590-1-rppt@kernel.org
Fixes: 013bb59dbb7c ("arm64: mte: handle tags zeroing at page allocation time")
Fixes: c275c5c6d50a ("kasan: disable freed user page poisoning with HW tags")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/page_alloc: don't corrupt pcppage_migratetype

When placing pages on a pcp list, migratetype values over
MIGRATE_PCPTYPES get added to the MIGRATE_MOVABLE pcp list.

However, the actual migratetype is preserved in the page and should
not be changed to MIGRATE_MOVABLE or the page may end up on the wrong
free_list.

The impact is that HIGHATOMIC or CMA pages getting bulk freed from the
PCP lists could potentially end up on the wrong buddy list. There are
various consequences but minimally NR_FREE_CMA_PAGES accounting could
get screwed up.

[mgorman@techsingularity.net: changelog update]

Link: https://lkml.kernel.org/r/20210811182917.2607994-1-opendmb@gmail.com
Fixes: df1acc856923 ("mm/page_alloc: avoid conflating IRQs disabled with zone->lock")
Signed-off-by: Doug Berger <opendmb@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Revert "mm: swap: check if swap backing device is congested or not"

Due to the change about how block layer detects congestion the
justification of commit 8fd2e0b505d1 ("mm: swap: check if swap backing
device is congested or not") doesn't stand anymore, so the commit could
be just reverted in order to solve the race reported by commit
2efa33fc7f6e ("mm/shmem: fix shmem_swapin() race with swapoff"). The
fix was reverted by the previous patch.

Link: https://lkml.kernel.org/r/20210810202936.2672-3-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Revert "mm/shmem: fix shmem_swapin() race with swapoff"

Due to the change about how block layer detects congestion the
justification of commit 8fd2e0b505d1 ("mm: swap: check if swap backing
device is congested or not") doesn't stand anymore, so the commit could
be just reverted in order to solve the race reported by commit
2efa33fc7f6e ("mm/shmem: fix shmem_swapin() race with swapoff"), so the
fix commit could be just reverted as well.

And that fix is also kind of buggy as discussed by [1] and [2].

[1] https://lore.kernel.org/linux-mm/24187e5e-069-9f3f-cefe-39ac70783753@google.com/
[2] https://lore.kernel.org/linux-mm/e82380b9-3ad4-4a52-be50-6d45c7f2b5da@google.com/

Link: https://lkml.kernel.org/r/20210810202936.2672-2-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

RDMA/efa: Free IRQ vectors on error flow

Make sure to free the IRQ vectors in case the allocation doesn't return
the expected number of IRQs.

Fixes: b7f5e880f377 ("RDMA/efa: Add the efa module")
Link: https://lore.kernel.org/r/20210811151131.39138-2-galpress@amazon.com
Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

riscv: Fix a number of free'd resources in init_resources()

Function init_resources() allocates a boot memory block to hold an array of
resources which it adds to iomem_resource. The array is filled in from its
end and the function then attempts to free any unused memory at the
beginning. The problem is that size of the unused memory is incorrectly
calculated and this can result in releasing memory which is in use by
active resources. Their data then gets corrupted later when the memory is
reused by a different part of the system.

Fix the size of the released memory to correctly match the number of unused
resource entries.

Fixes: ffe0e5261268 ("RISC-V: Improve init_resources()")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Reviewed-by: Sunil V L <sunilvl@ventanamicro.com>
Acked-by: Nick Kossifidis <mick@ics.forth.gr>
Tested-by: Sunil V L <sunilvl@ventanamicro.com>
Cc: stable@vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>

Merge tag 'amd-drm-fixes-5.14-2021-08-18' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes

amd-drm-fixes-5.14-2021-08-18:

amdgpu:
- vega10 SMU workload fix
- DCN VM fix
- DCN 3.01 watermark fix

amdkfd:
- SVM fix

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210818225137.4070-1-alexander.deucher@amd.com

dt-bindings: sifive-l2-cache: Fix 'select' matching

When the schema fixups are applied to 'select' the result is a single
entry is required for a match, but that will never match as there should
be 2 entries. Also, a 'select' schema should have the widest possible
match, so use 'contains' which matches the compatible string(s) in any
position and not just the first position.

Fixes: 993dcfac64eb ("dt-bindings: riscv: sifive-l2-cache: convert bindings to json-schema")
Signed-off-by: Rob Herring <robh@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>

Merge tag 'mediatek-drm-fixes-5.14-2' of https://git.kernel.org/pub/scm/linux/kernel/git/chunkuang.hu/linux into drm-fixes

Mediatek DRM Fixes for Linux 5.14-2

1. Fix AAL output size setting.
2. Delete component in remove function.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Chun-Kuang Hu <chunkuang.hu@kernel.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20210819001635.14803-1-chunkuang.hu@kernel.org

Merge tag 'drm-intel-fixes-2021-08-18' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes

- Expand a tweaked display workaround for all PCHs. (Anshuman)
- Fix eDP MSO pipe sanity checks for ADL-P. (Jani)
- Remove superfluous EXPORT_SYMBOL(). (Jani)

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/YR137zkSAIbun1Ed@intel.com

Merge branch 'linux-5.14' of git://github.com/skeggsb/linux into drm-fixes

- Ampere display fixes
- Fix longstanding MM race issue by removing unused code.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Ben Skeggs <skeggsb@gmail.com>
Link: https://patchwork.freedesktop.org/patch/msgid/CACAvsv5jtUFkHsGe-pf-=RceDOgKygjPnCi=6d5vCLM_f5aeMQ@mail.gmail.com

RDMA/rxe: Fix memory allocation while in a spin lock

rxe_mcast_add_grp_elem() in rxe_mcast.c calls rxe_alloc() while holding
spinlocks which in turn calls kzalloc(size, GFP_KERNEL) which is
incorrect. This patch replaces rxe_alloc() by rxe_alloc_locked() which
uses GFP_ATOMIC. This bug was caused by the below mentioned commit and
failing to handle the need for the atomic allocate.

Fixes: 4276fd0dddc9 ("RDMA/rxe: Remove RXE_POOL_ATOMIC")
Link: https://lore.kernel.org/r/20210813210625.4484-1-rpearsonhpe@gmail.com
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Merge tag 'soc-fixes-5.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc

Pull ARM SoC fixes from Arnd Bergmann:
"Not much to see here. Half the fixes this time are for Qualcomm dts
  files, fixing small mistakes on certain machines. The other fixes are:

   - A 5.13 regression fix for freescale QE interrupt controller\

   - A fix for TI OMAP gpt12 timer error handling

   - A randconfig build regression fix for ixp4xx

   - Another defconfig fix following the CONFIG_FB dependency rework"

* tag 'soc-fixes-5.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
  soc: fsl: qe: fix static checker warning
  ARM: ixp4xx: fix building both pci drivers
  ARM: configs: Update the nhk8815_defconfig
  bus: ti-sysc: Fix error handling for sysc_check_active_timer()
  soc: fsl: qe: convert QE interrupt controller to platform_device
  arm64: dts: qcom: sdm845-oneplus: fix reserved-mem
  arm64: dts: qcom: msm8994-angler: Disable cont_splash_mem
  arm64: dts: qcom: sc7280: Fixup cpufreq domain info for cpu7
  arm64: dts: qcom: msm8992-bullhead: Fix cont_splash_mem mapping
  arm64: dts: qcom: msm8992-bullhead: Remove PSCI
  arm64: dts: qcom: c630: fix correct powerdown pin for WSA881x

Merge tag 'drm-misc-fixes-2021-08-18' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes

Short summary of fixes pull:

* UAPI: Return results for failed drm_wait_vblank_ioctl()
* ttm: Fix debugfs initialization

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/YR1c7cG1IaL+g8EN@linux-uq9g.fritz.box

Merge tag 'net-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
"Networking fixes, including fixes from bpf, wireless and mac80211
  trees.

  Current release - regressions:

   - tipc: call tipc_wait_for_connect only when dlen is not 0

   - mac80211: fix locking in ieee80211_restart_work()

  Current release - new code bugs:

   - bpf: add rcu_read_lock in bpf_get_current_[ancestor_]cgroup_id()

   - ethernet: ice: fix perout start time rounding

   - wwan: iosm: prevent underflow in ipc_chnl_cfg_get()

  Previous releases - regressions:

   - bpf: clear zext_dst of dead insns

   - sch_cake: fix srchost/dsthost hashing mode

   - vrf: reset skb conntrack connection on VRF rcv

   - net/rds: dma_map_sg is entitled to merge entries

  Previous releases - always broken:

   - ethernet: bnxt: fix Tx path locking and races, add Rx path
     barriers"

* tag 'net-5.14-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (42 commits)
  net: dpaa2-switch: disable the control interface on error path
  Revert "flow_offload: action should not be NULL when it is referenced"
  iavf: Fix ping is lost after untrusted VF had tried to change MAC
  i40e: Fix ATR queue selection
  r8152: fix the maximum number of PLA bp for RTL8153C
  r8152: fix writing USB_BP2_EN
  mptcp: full fully established support after ADD_ADDR
  mptcp: fix memory leak on address flush
  net/rds: dma_map_sg is entitled to merge entries
  net: mscc: ocelot: allow forwarding from bridge ports to the tag_8021q CPU port
  net: asix: fix uninit value bugs
  ovs: clear skb->tstamp in forwarding path
  net: mdio-mux: Handle -EPROBE_DEFER correctly
  net: mdio-mux: Don't ignore memory allocation errors
  net: mdio-mux: Delete unnecessary devm_kfree
  net: dsa: sja1105: fix use-after-free after calling of_find_compatible_node, or worse
  sch_cake: fix srchost/dsthost hashing mode
  ixgbe, xsk: clean up the resources in ixgbe_xsk_pool_enable error path
  net: qlcnic: add missed unlock in qlcnic_83xx_flash_read32
  mac80211: fix locking in ieee80211_restart_work()
  ...