www.infradead.org Git - users/willy/linux.git/log

]> www.infradead.org Git - users/willy/linux.git/log

projects / users / willy / linux.git / log

Ian Kent [Wed, 5 Dec 2018 00:14:10 +0000 (11:14 +1100)]

autofs: fix possible inode leak in autofs_fill_super()

There is no check at all for a failure to allocate the root inode in
autofs_fill_super(), handle it.

Link: http://lkml.kernel.org/r/154296971705.9889.9822861969081020188.stgit@pluto-themaw-net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Andrew Morton [Wed, 5 Dec 2018 00:14:10 +0000 (11:14 +1100)]

autofs-improve-ioctl-sbi-checks-fix

declare autofs_fs_type in .h, not .c

Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Ian Kent [Wed, 5 Dec 2018 00:14:10 +0000 (11:14 +1100)]

autofs: improve ioctl sbi checks

Al Viro made some suggestions to improve the implementation of commit
0633da48f0 "fix autofs_sbi() does not check super block type".

The check is unnecessary in all cases except for ioctl usage so placing
the check in the super block accessor function adds a small overhead to
the common case where it isn't needed.

So it's sufficient to do this in the ioctl code only.

Also the check in the ioctl code is needlessly complex.

Link: http://lkml.kernel.org/r/154296970987.9889.1597442413573683096.stgit@pluto-themaw-net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Alexey Dobriyan [Wed, 5 Dec 2018 00:14:09 +0000 (11:14 +1100)]

init/main.c: make "initcall_level_names[]" const char *

Initcall names should not be changed.

Link: http://lkml.kernel.org/r/20181124091829.GD10969@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Andrew Morton [Wed, 5 Dec 2018 00:14:09 +0000 (11:14 +1100)]

fs-epoll-deal-with-wait_queue-only-once-fix

restore code to original position, fix and reflow comment

Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Davidlohr Bueso [Wed, 5 Dec 2018 00:14:09 +0000 (11:14 +1100)]

fs/epoll: deal with wait_queue only once

There is no reason why we rearm the waitiqueue upon every fetch_events
retry (for when events are found yet send_events() fails). If nothing
else, this saves four lock operations per retry, and furthermore reduces
the scope of the lock even further.

Link: http://lkml.kernel.org/r/20181114182532.27981-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Davidlohr Bueso [Wed, 5 Dec 2018 00:14:09 +0000 (11:14 +1100)]

fs/epoll: rename check_events label to send_events

It is currently called check_events because it, well, did exactly that.
However, since the lockless ep_events_available() call, the label no
longer checks, but just sends the events. Rename as such.

Link: http://lkml.kernel.org/r/20181114182532.27981-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Davidlohr Bueso [Wed, 5 Dec 2018 00:14:09 +0000 (11:14 +1100)]

fs-epoll-avoid-barrier-after-an-epoll_wait2-timeout-fix

Sorry, I just realized I forgot these fixlets when sending the v1.

Link: http://lkml.kernel.org/r/20181109155258.jxcr4t2pnz6zqct3@linux-r8p5
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Davidlohr Bueso [Wed, 5 Dec 2018 00:14:08 +0000 (11:14 +1100)]

fs/epoll: avoid barrier after an epoll_wait(2) timeout

Upon timeout, we can just exit out of the loop, without the cost of the
changing the task's state with an smp_store_mb call. Just exit out of the
loop and be done - setting the task state afterwards will be, of course,
redundant.

Link: http://lkml.kernel.org/r/20181108051006.18751-7-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Davidlohr Bueso [Wed, 5 Dec 2018 00:14:08 +0000 (11:14 +1100)]

fs-epoll-reduce-the-scope-of-wq-lock-in-epoll_wait-fix

Sorry, I just realized I forgot these fixlets when sending the v1.

Link: http://lkml.kernel.org/r/20181109155258.jxcr4t2pnz6zqct3@linux-r8p5
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Davidlohr Bueso [Wed, 5 Dec 2018 00:14:08 +0000 (11:14 +1100)]

fs/epoll: reduce the scope of wq lock in epoll_wait()

This patch aims at reducing ep wq.lock hold times in epoll_wait(2).  For
the blocking case, there is no need to constantly take and drop the
spinlock, which is only needed to manipulate the waitqueue.

The call to ep_events_available() is now lockless, and only exposed to
benign races.  Here, if false positive (returns available events and does
not see another thread deleting an epi from the list) we call into
send_events and then the list's state is correctly seen.  Otoh, if a false
negative and we don't see a list_add_tail(), for example, from irq
callback, then it is rechecked again before blocking, which will see the
correct state.

In order for more accuracy to see concurrent list_del_init(), use the
list_empty_careful() variant -- of course, this won't be safe against
insertions from wakeup.

For the overflow list we obviously need to prevent load/store tearing as
we don't want to see partial values while the ready list is disabled.

Link: http://lkml.kernel.org/r/20181108051006.18751-6-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Suggested-by: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Davidlohr Bueso [Wed, 5 Dec 2018 00:14:08 +0000 (11:14 +1100)]

fs/epoll: robustify ep->mtx held checks

Insted of just commenting how important it is, lets make it more robust
and add a lockdep_assert_held() call.

Link: http://lkml.kernel.org/r/20181108051006.18751-5-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Davidlohr Bueso [Wed, 5 Dec 2018 00:14:08 +0000 (11:14 +1100)]

fs/epoll: drop ovflist branch prediction

The ep->ovflist is a secondary ready-list to temporarily store events that
might occur when doing sproc without holding the ep->wq.lock. This
accounts for every time we check for ready events and also send events
back to userspace; both callbacks, particularly the latter because of
copy_to_user, can account for a non-trivial time.

As such, the unlikely() check to see if the pointer is being used, seems
both misleading and sub-optimal. In fact, we go to an awful lot of
trouble to sync both lists, and populating the ovflist is far from an
uncommon scenario.

For example, profiling a concurrent epoll_wait(2) benchmark, with
CONFIG_PROFILE_ANNOTATED_BRANCHES shows that for a two threads a 33%
incorrect rate was seen; and when incrementally increasing the number of
epoll instances (which is used, for example for multiple queuing load
balancing models), up to a 90% incorrect rate was seen.

Similarly, by deleting the prediction, 3% throughput boost was seen across
incremental threads.

Link: http://lkml.kernel.org/r/20181108051006.18751-4-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Davidlohr Bueso [Wed, 5 Dec 2018 00:14:07 +0000 (11:14 +1100)]

fs/epoll: simplify ep_send_events_proc() ready-list loop

The current logic is a bit convoluted. Lets simplify this with a standard
list_for_each_entry_safe() loop instead and just break out after maxevents
is reached.

While at it, remove an unnecessary indentation level in the loop when
there are in fact ready events.

Link: http://lkml.kernel.org/r/20181108051006.18751-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Davidlohr Bueso [Wed, 5 Dec 2018 00:14:07 +0000 (11:14 +1100)]

fs/epoll: remove max_nests argument from ep_call_nested()

Patch series "epoll: some miscellaneous optimizations".

The following are some incremental optimizations on some of the epoll
core. Each patch has the details, but together, the series is seen to
shave off measurable cycles on a number of systems and workloads.

For example, on a 40-core IB, a pipetest as well as parallel epoll_wait()
benchmark show around a 20-30% increase in raw operations per second when
the box is fully occupied (incremental thread counts), and up to 15%
performance improvement with lower counts.

Passes ltp epoll related testcases.

This patch(of 6):

All callers pass the EP_MAX_NESTS constant already, so lets simplify this
a tad and get rid of the redundant parameter for nested eventpolls.

Link: http://lkml.kernel.org/r/20181108051006.18751-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Joe Perches [Wed, 5 Dec 2018 00:14:07 +0000 (11:14 +1100)]

checkpatch: warn on const char foo[] = "bar"; declarations

These declarations should generally be static const to avoid poor
compilation and runtime performance where compilers tend to initialize the
const declaration for every call instead of using .rodata for the string.

Miscellanea:

o Convert spaces to tabs for indentation in 2 adjacent checks

Link: http://lkml.kernel.org/r/10ea5f4b087dc911e41e187a4a2b5e79c7529aa3.camel@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Yury Norov [Wed, 5 Dec 2018 00:14:07 +0000 (11:14 +1100)]

lib/find_bit_benchmark.c: align test_find_next_and_bit with others

Contrary to other tests, test_find_next_and_bit() test uses tab formatting
in output and get_cycles() instead of ktime_get().  get_cycles() is not
supported by some arches, so ktime_get() fits better in generic code.

Fix it and minor style issues, so the output looks like this:

Start testing find_bit() with random-filled bitmap
find_next_bit:                 7142816 ns, 163282 iterations
find_next_zero_bit:            8545712 ns, 164399 iterations
find_last_bit:                 6332032 ns, 163282 iterations
find_first_bit:               20509424 ns,  16606 iterations
find_next_and_bit:             4060016 ns,  73424 iterations

Start testing find_bit() with sparse bitmap
find_next_bit:                   55984 ns,    656 iterations
find_next_zero_bit:           19197536 ns, 327025 iterations
find_last_bit:                   65088 ns,    656 iterations
find_first_bit:                5923712 ns,    656 iterations
find_next_and_bit:               29088 ns,      1 iterations

Link: http://lkml.kernel.org/r/20181123174803.10916-1-ynorov@caviumnetworks.com
Signed-off-by: Yury Norov <ynorov@caviumnetworks.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Norov, Yuri" <Yuri.Norov@cavium.com>
Cc: Clement Courbet <courbet@google.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Alexey Skidanov [Wed, 5 Dec 2018 00:14:07 +0000 (11:14 +1100)]

lib/genalloc.c: fix allocation of aligned buffer from non-aligned chunk

gen_pool_alloc_algo() uses different allocation functions implementing
different allocation algorithms.  With gen_pool_first_fit_align()
allocation function, the returned address should be aligned on the
requested boundary.

If chunk start address isn't aligned on the requested boundary, the
returned address isn't aligned too.  The only way to get properly aligned
address is to initialize the pool with chunks aligned on the requested
boundary.  If want to have an ability to allocate buffers aligned on
different boundaries (for example, 4K, 1MB, ...), the chunk start address
should be aligned on the max possible alignment.

This happens because gen_pool_first_fit_align() looks for properly aligned
memory block without taking into account the chunk start address
alignment.

To fix this, we provide chunk start address to gen_pool_first_fit_align()
and change its implementation such that it starts looking for properly
aligned block with appropriate offset (exactly as is done in CMA).

Link: https://lkml.kernel.org/lkml/a170cf65-6884-3592-1de9-4c235888cc8a@intel.com
Link: http://lkml.kernel.org/r/1541690953-4623-1-git-send-email-alexey.skidanov@intel.com
Signed-off-by: Alexey Skidanov <alexey.skidanov@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Daniel Mentz <danielmentz@google.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Laura Abbott <labbott@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Matthew Wilcox [Wed, 5 Dec 2018 00:14:06 +0000 (11:14 +1100)]

fls: change parameter to unsigned int

When testing in userspace, UBSAN pointed out that shifting into the sign
bit is undefined behaviour. It doesn't really make sense to ask for the
highest set bit of a negative value, so just turn the argument type into
an unsigned int.

Some architectures (eg ppc) already had it declared as an unsigned int, so
I don't expect too many problems.

Link: http://lkml.kernel.org/r/20181105221117.31828-1-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Alexey Dobriyan [Wed, 5 Dec 2018 00:14:06 +0000 (11:14 +1100)]

include/linux/printk.h: drop silly "static inline asmlinkage" from dump_stack()

Empty function will be inlined so asmlinkage doesn't do anything.

Link: http://lkml.kernel.org/r/20181124093530.GE10969@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Joey Pabalinas <joeypabalinas@gmail.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Souptick Joarder [Wed, 5 Dec 2018 00:14:06 +0000 (11:14 +1100)]

drivers/dma-buf/udmabuf.c: convert to use vm_fault_t

Use new return type vm_fault_t for fault handler.

Link: http://lkml.kernel.org/r/20181106173628.GA12989@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Masahiro Yamada [Wed, 5 Dec 2018 00:14:06 +0000 (11:14 +1100)]

build_bug.h: remove most of dummy BUILD_BUG_ON stubs for Sparse

The introduction of these dummy BUILD_BUG_ON stubs dates back to
903c0c7cdc21 ("sparse: define dummy BUILD_BUG_ON definition for sparse").

At that time, BUILD_BUG_ON() was implemented with the negative array trick
*and* the link-time trick, like this:

  extern int __build_bug_on_failed;
  #define BUILD_BUG_ON(condition)                                \
          do {                                                   \
                  ((void)sizeof(char[1 - 2*!!(condition)]));     \
                  if (condition) __build_bug_on_failed = 1;      \
          } while(0)

Sparse is more strict about the negative array trick than GCC because
Sparse requires the array length to be really constant.

Here is the simple test code for the macro above:

  static const int x = 0;
  BUILD_BUG_ON(x);

GCC is absolutely fine with it (-Wvla was enabled only very recently),
but Sparse warns like this:

  error: bad constant expression
  error: cannot size expression

(If you are using a newer version of Sparse, you will see a different
warning message, "warning: Variable length array is used".)

Anyway, Sparse was producing many false positives, and noisier than
it should be at that time.

With the previous commit, the leftover negative array trick is gone.
Sparse is fine with the current BUILD_BUG_ON(), which is implemented by
using the 'error' attribute.

I am keeping the stub for BUILD_BUG_ON_ZERO().  Otherwise, Sparse would
complain about the following code, which GCC is fine with:

  static const int x = 0;
  int y = BUILD_BUG_ON_ZERO(x);

Link: http://lkml.kernel.org/r/1542856462-18836-3-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Masahiro Yamada [Wed, 5 Dec 2018 00:14:05 +0000 (11:14 +1100)]

build_bug.h: remove negative-array fallback for BUILD_BUG_ON()

The kernel can only be compiled with an optimization option (-O2, -Os, or
the currently proposed -Og). Hence, __OPTIMIZE__ is always defined in the
kernel source.

The fallback for the -O0 case is just hypothetical and pointless.
Moreover, commit 0bb95f80a38f ("Makefile: Globally enable VLA warning")
enabled -Wvla warning. The use of variable length arrays is banned.

Link: http://lkml.kernel.org/r/1542856462-18836-2-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Alexey Dobriyan [Wed, 5 Dec 2018 00:14:05 +0000 (11:14 +1100)]

Documentation/process/coding-style.rst: don't use "extern" with function prototypes

`extern' with function prototypes makes lines longer and creates more
characters on the screen.

Do not bug people with checkpatch.pl warnings for now as fallout can be
devastating.

Link: http://lkml.kernel.org/r/20181101134153.GA29267@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Alexey Dobriyan [Wed, 5 Dec 2018 00:14:05 +0000 (11:14 +1100)]

fs/proc/base.c: slightly faster /proc/*/limits

Header of /proc/*/limits is a fixed string, so print it directly without
formatting specifiers.

Link: http://lkml.kernel.org/r/20181203164242.GB6904@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Alexey Dobriyan [Wed, 5 Dec 2018 00:14:05 +0000 (11:14 +1100)]

fs/proc/inode.c: delete unnecessary variable in proc_alloc_inode()

Link: http://lkml.kernel.org/r/20181203164015.GA6904@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Eric Biggers [Wed, 5 Dec 2018 00:14:05 +0000 (11:14 +1100)]

fs/proc/util.c: include fs/proc/internal.h for name_to_int()

name_to_int() is defined in fs/proc/util.c and declared in
fs/proc/internal.h, but the declaration isn't included at the point of the
definition. Include the header to enforce that the definition matches the
declaration.

This addresses a gcc warning when -Wmissing-prototypes is enabled.

Link: http://lkml.kernel.org/r/20181115001833.49371-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Benjamin Gordon [Wed, 5 Dec 2018 00:14:04 +0000 (11:14 +1100)]

fs/proc/base.c: use ns_capable instead of capable for timerslack_ns

Access to timerslack_ns is controlled by a process having CAP_SYS_NICE in
its effective capability set, but the current check looks in the root
namespace instead of the process' user namespace. Since a process is
allowed to do other activities controlled by CAP_SYS_NICE inside a
namespace, it should also be able to adjust timerslack_ns.

Link: http://lkml.kernel.org/r/20181030180012.232896-1-bmgordon@google.com
Signed-off-by: Benjamin Gordon <bmgordon@google.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Todd Kjos <tkjos@google.com>
Cc: Colin Cross <ccross@android.com>
Cc: Nick Kralevich <nnk@google.com>
Cc: Dmitry Shmidt <dimitrysh@google.com>
Cc: Elliott Hughes <enh@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Tetsuo Handa [Wed, 5 Dec 2018 00:14:04 +0000 (11:14 +1100)]

fs/buffer.c: add debug print for __getblk_gfp() stall problem

Among syzbot's unresolved hung task reports, 18 out of 65 reports contain
__getblk_gfp() line in the backtrace. Since there is a comment block that
says that __getblk_gfp() will lock up the machine if try_to_free_buffers()
attempt from grow_dev_page() is failing, let's start from checking whether
syzbot is hitting that case. This change will be removed after the bug is
fixed.

Link: http://lkml.kernel.org/r/9b9fcdda-c347-53ee-fdbb-8a7d11cf430e@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: <syzkaller-bugs@googlegroups.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

zhong jiang [Wed, 5 Dec 2018 00:14:04 +0000 (11:14 +1100)]

mm/page_owner: align with pageblock_nr pages

When pfn_valid(pfn) returns false, pfn should be aligned with
pageblock_nr_pages other than MAX_ORDER_NR_PAGES in init_pages_in_zone,
because the skipped 2M may be valid pfn, as a result, early allocated
count will not be accurate.

Link: http://lkml.kernel.org/r/1468938136-24228-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Marc-André Lureau [Wed, 5 Dec 2018 00:14:04 +0000 (11:14 +1100)]

mm/page_owner: align with pageblock_nr_pages

Currently, init_pages_in_zone() walks the zone in pageblock_nr_pages
steps.  MAX_ORDER_NR_PAGES is possible to have holes when
CONFIG_HOLES_IN_ZONE is set.  it is likely to be different between
MAX_ORDER_NR_PAGES and pageblock_nr_pages.  if we skip the size of
MAX_ORDER_NR_PAGES, it will result in the second 2M memroy leak.

Meanwhile, the change will make the code consistent.  because the entire
function is based on the pageblock_nr_pages steps.

Link: http://lkml.kernel.org/r/1512395284-13588-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Yu Zhao [Wed, 5 Dec 2018 00:14:04 +0000 (11:14 +1100)]

mm: don't expose page to fast gup before it's ready

We don't want to expose page before it's properly setup.  During page
setup, we may call page_add_new_anon_rmap() which uses non- atomic bit op.
If page is exposed before it's done, we could overwrite page flags that
are set by get_user_pages_fast() or its callers.  Here is a non-fatal
scenario (there might be other fatal problems that I didn't look into):

CPU 1 CPU1
set_pte_at() get_user_pages_fast()
page_add_new_anon_rmap() gup_pte_range()
__SetPageSwapBacked() SetPageReferenced()

Fix the problem by delaying set_pte_at() until page is ready.

I didn't observe the race directly.  But I did get few crashes when
trying to access mem_cgroup of pages returned by get_user_pages_fast().
Those page were charged and they showed valid mem_cgroup in kdumps.
So this led me to think the problem came from premature set_pte_at().

I think the fact that nobody complained about this problem is because
the race only happens when using ksm+swap, and it might not cause any
fatal problem even so.  Nevertheless, it's nice to have set_pte_at()
done consistently after rmap is added and page is charged.

Link: http://lkml.kernel.org/r/20180108225632.16332-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Huang Ying [Wed, 5 Dec 2018 00:14:03 +0000 (11:14 +1100)]

mm: fix race between swapoff and mincore

Via commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks") on,
after swapoff, the address_space associated with the swap device will be
freed.  So swap_address_space() users which touch the address_space need
some kind of mechanism to prevent the address_space from being freed
during accessing.

When mincore process unmapped range for swapped shmem pages, it doesn't
hold the lock to prevent swap device from being swapoff.  So the following
race is possible,

CPU1 CPU2
do_mincore() swapoff()
  walk_page_range()
    mincore_unmapped_range()
      __mincore_unmapped_range
        mincore_page
  as = swap_address_space()
          ...   exit_swap_address_space()
          ...     kvfree(spaces)
  find_get_page(as)

The address space may be accessed after being freed.

To fix the race, get_swap_device()/put_swap_device() is used to enclose
find_get_page() to check whether the swap entry is valid and prevent the
swap device from being swapoff during accessing.

Link: http://lkml.kernel.org/r/20180313012036.1597-1-ying.huang@intel.com
Fixes: 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Huang Ying [Wed, 5 Dec 2018 00:14:03 +0000 (11:14 +1100)]

mm, swap: fix race between swapoff and some swap operations

- Add more comments to get_swap_device() to make it more clear about
possible swapoff or swapoff+swapon.

Link: http://lkml.kernel.org/r/20180223060010.954-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Huang Ying [Wed, 5 Dec 2018 00:14:03 +0000 (11:14 +1100)]

mm, swap: fix race between swapoff and some swap operations

When swapin is performed, after getting the swap entry information from
the page table, system will swap in the swap entry, without any lock held
to prevent the swap device from being swapoff.  This may cause the race
like below,

CPU 1 CPU 2
----- -----
do_swap_page
  swapin_readahead
    __read_swap_cache_async
swapoff       swapcache_prepare
  p->swap_map = NULL         __swap_duplicate
  p->swap_map[?] /* !!! NULL pointer access */

Because swapoff is usually done when system shutdown only, the race may
not hit many people in practice.  But it is still a race need to be fixed.

To fix the race, get_swap_device() is added to check whether the specified
swap entry is valid in its swap device.  If so, it will keep the swap
entry valid via preventing the swap device from being swapoff, until
put_swap_device() is called.

Because swapoff() is very rare code path, to make the normal path runs as
fast as possible, disabling preemption + stop_machine() instead of
reference count is used to implement get/put_swap_device().  From
get_swap_device() to put_swap_device(), the preemption is disabled, so
stop_machine() in swapoff() will wait until put_swap_device() is called.

In addition to swap_map, cluster_info, etc.  data structure in the struct
swap_info_struct, the swap cache radix tree will be freed after swapoff,
so this patch fixes the race between swap cache looking up and swapoff
too.

Races between some other swap cache usages protected via disabling
preemption and swapoff are fixed too via calling stop_machine() between
clearing PageSwapCache() and freeing swap cache data structure.

Alternative implementation could be replacing disable preemption with
rcu_read_lock_sched and stop_machine() with synchronize_sched().

Link: http://lkml.kernel.org/r/20180213014220.2464-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Arun KS [Wed, 5 Dec 2018 00:14:03 +0000 (11:14 +1100)]

mm/page_alloc.c: remove software prefetching in __free_pages_core()

They not only increase the code footprint, they actually make things
slower rather than faster. Remove them as contemporary hardware doesn't
need any hint.

Link: http://lkml.kernel.org/r/1538727006-5727-2-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Andrew Morton [Wed, 5 Dec 2018 00:14:03 +0000 (11:14 +1100)]

memory_hotplug-free-pages-as-higher-order-fix-fix

fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch

Cc: Arun KS <arunks@codeaurora.org>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Andrew Morton [Wed, 5 Dec 2018 00:14:02 +0000 (11:14 +1100)]

memory_hotplug-free-pages-as-higher-order-fix

avoid return of void-returning __free_pages_core(), per Oscar

Cc: Arun KS <arunks@codeaurora.org>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Arun KS [Wed, 5 Dec 2018 00:14:02 +0000 (11:14 +1100)]

mm/page_alloc.c: memory hotplug: free pages as higher order

When freeing pages are done with higher order, time spent on coalescing
pages by buddy allocator can be reduced. With section size of 256MB, hot
add latency of a single section shows improvement from 50-60 ms to less
than 1 ms, hence improving the hot add latency by 60%. Modify external
providers of online callback to align with the change.

Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.org
Signed-off-by: Arun KS <arunks@codeaurora.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Anthony Yznaga [Wed, 5 Dec 2018 00:14:02 +0000 (11:14 +1100)]

/proc/kpagecount: return 0 for special pages that are never mapped

Certain pages that are never mapped to userspace have a type indicated in
the page_type field of their struct pages (e.g. PG_buddy). page_type
overlaps with _mapcount so set the count to 0 and avoid calling
page_mapcount() for these pages.

Link: http://lkml.kernel.org/r/1543963526-27917-1-git-send-email-anthony.yznaga@oracle.com
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Miles Chen <miles.chen@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Anthony Yznaga [Wed, 5 Dec 2018 00:14:02 +0000 (11:14 +1100)]

tools/vm/page-types.c: fix "kpagecount returned fewer pages than expected" failures

Because kpagecount_read() fakes success if map counts are not being
collected, clamp the page count passed to it by walk_pfn() to the pages
value returned by the preceding call to kpageflags_read().

Link: http://lkml.kernel.org/r/1543962269-26116-1-git-send-email-anthony.yznaga@oracle.com
Fixes: 7f1d23e60718 ("tools/vm/page-types.c: include shared map counts")
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Andrew Morton [Wed, 5 Dec 2018 00:14:01 +0000 (11:14 +1100)]

mm-use-common-iterator-for-deferred_init_pages-and-deferred_free_pages-fix

s/__free_pages_core/__free_pages_boot_core/, for patch ordering

Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Alexander Duyck [Wed, 5 Dec 2018 00:14:01 +0000 (11:14 +1100)]

mm: use common iterator for deferred_init_pages and deferred_free_pages

Create a common iterator to be used by both deferred_init_pages() and
deferred_free_pages().  By doing this we can cut down a bit on code
overhead as they will likely both be inlined into the same function
anyway.

This new approach allows deferred_init_pages to make use of
__init_pageblock().  By doing this we can cut down on the code size by
sharing code between both the hotplug and deferred memory init code paths.

An additional benefit to this approach is that we improve in cache
locality of the memory init as we can focus on the memory areas related to
identifying if a given PFN is valid and keep that warm in the cache until
we transition to a region of a different type.  So we will stream through
a chunk of valid blocks before we turn to initializing page structs.

On my x86_64 test system with 384GB of memory per node I saw a reduction
in initialization time from 1.38s to 1.06s as a result of this patch.

Link: http://lkml.kernel.org/r/154361480390.7497.9730184349746888133.stgit@ahduyck-desk1.amr.corp.intel.com
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Alexander Duyck [Wed, 5 Dec 2018 00:14:01 +0000 (11:14 +1100)]

mm: add reserved flag setting to set_page_links()

Modify set_page_links() to include the setting of the reserved flag via a
simple AND and OR operation. The motivation for this is the fact that the
existing __set_bit call still seems to have effects on performance as
replacing the call with the AND and OR can reduce initialization time.

Looking over the assembly code before and after the change the main
difference between the two is that the reserved bit is stored in a value
that is generated outside of the main initialization loop and is then
written with the other flags field values in one write to the page->flags
value. Previously the generated value was written and then then a btsq
instruction was issued.

On my x86_64 test system with 3TB of persistent memory per node I saw the
persistent memory initialization time on average drop from 23.49s to
19.12s per node.

Link: http://lkml.kernel.org/r/154361479877.7497.2824031260670152276.stgit@ahduyck-desk1.amr.corp.intel.com
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Alexander Duyck [Wed, 5 Dec 2018 00:14:01 +0000 (11:14 +1100)]

mm: move hot-plug specific memory init into separate functions and optimize

Combine the bits in memmap_init_zone and memmap_init_zone_device that are
related to hotplug into a single function called __memmap_init_hotplug.

Also take the opportunity to integrate __init_single_page's functionality
into this function. In doing so we can get rid of some of the redundancy
such as the LRU pointers versus the pgmap.

Link: http://lkml.kernel.org/r/154361479366.7497.13916678539146224699.stgit@ahduyck-desk1.amr.corp.intel.com
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Alexander Duyck [Wed, 5 Dec 2018 00:14:01 +0000 (11:14 +1100)]

mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections

Add yet another iterator, for_each_free_mem_range_in_zone_from, and then
use it to support initializing and freeing pages in groups no larger than
MAX_ORDER_NR_PAGES. By doing this we can greatly improve the cache
locality of the pages while we do several loops over them in the init and
freeing process.

We are able to tighten the loops further as a result of the "from"
iterator as we can perform the initial checks for first_init_pfn in our
first call to the iterator, and continue without the need for those checks
via the "from" iterator. I have added this functionality in the function
called deferred_init_mem_pfn_range_in_zone that primes the iterator and
causes us to exit if we encounter any failure.

On my x86_64 test system with 384GB of memory per node I saw a reduction
in initialization time from 1.85s to 1.38s as a result of this patch.

Link: http://lkml.kernel.org/r/154361478854.7497.15456929701404283744.stgit@ahduyck-desk1.amr.corp.intel.com
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Alexander Duyck [Wed, 5 Dec 2018 00:14:00 +0000 (11:14 +1100)]

mm: implement new zone specific memblock iterator

Introduce a new iterator for_each_free_mem_pfn_range_in_zone.

This iterator will take care of making sure a given memory range provided
is in fact contained within a zone. It takes are of all the bounds
checking we were doing in deferred_grow_zone, and deferred_init_memmap.
In addition it should help to speed up the search a bit by iterating until
the end of a range is greater than the start of the zone pfn range, and
will exit completely if the start is beyond the end of the zone.

Link: http://lkml.kernel.org/r/154361478343.7497.6591693538181082582.stgit@ahduyck-desk1.amr.corp.intel.com
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Alexander Duyck [Wed, 5 Dec 2018 00:14:00 +0000 (11:14 +1100)]

mm: drop meminit_pfn_in_nid as it is redundant

As best as I can tell the meminit_pfn_in_nid call is completely redundant.
The deferred memory initialization is already making use of
for_each_free_mem_range which in turn will call into __next_mem_range
which will only return a memory range if it matches the node ID provided
assuming it is not NUMA_NO_NODE.

I am operating on the assumption that there are no zones or pgdata_t
structures that have a NUMA node of NUMA_NO_NODE associated with them. If
that is the case then __next_mem_range will never return a memory range
that doesn't match the zone's node ID and as such the check is redundant.

So one piece I would like to verify on this is if this works for ia64.
Technically it was using a different approach to get the node ID, but it
seems to have the node ID also encoded into the memblock. So I am
assuming this is okay, but would like to get confirmation on that.

On my x86_64 test system with 384GB of memory per node I saw a reduction
in initialization time from 2.80s to 1.85s as a result of this patch.

Link: http://lkml.kernel.org/r/154361477830.7497.18073959471440151885.stgit@ahduyck-desk1.amr.corp.intel.com
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Alexander Duyck [Wed, 5 Dec 2018 00:14:00 +0000 (11:14 +1100)]

mm: use mm_zero_struct_page from SPARC on all 64b architectures

Patch series "Deferred page init improvements", v6.

This patchset is essentially a refactor of the page initialization logic
that is meant to provide for better code reuse while providing a
significant improvement in deferred page initialization performance.

In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
memory per node I have seen the following.  In the case of regular memory
initialization the deferred init time was decreased from 3.75s to 1.06s on
average.  For the persistent memory the initialization time dropped from
24.17s to 19.12s on average.  This amounts to a 253% improvement for the
deferred memory initialization performance, and a 26% improvement in the
persistent memory initialization performance.

I have called out the improvement observed with each patch.

This patch (of 7):

Use the same approach that was already in use on Sparc on all the
architectures that support a 64b long.

This is mostly motivated by the fact that 7 to 10 store/move instructions
are likely always going to be faster than having to call into a function
that is not specialized for handling page init.

An added advantage to doing it this way is that the compiler can get away
with combining writes in the __init_single_page call.  As a result the
memset call will be reduced to only about 4 write operations, or at least
that is what I am seeing with GCC 6.2 as the flags, LRU pointers, and
count/mapcount seem to be cancelling out at least 4 of the 8 assignments
on my system.

One change I had to make to the function was to reduce the minimum page
size to 56 to support some powerpc64 configurations.

This change should introduce no change on SPARC since it already had this
code.  In the case of x86_64 I saw a reduction from 3.75s to 2.80s when
initializing 384GB of RAM per node.  Pavel Tatashin tested on a system
with Broadcom's Stingray CPU and 48GB of RAM and found that
__init_single_page() takes 19.30ns / 64-byte struct page before this patch
and with this patch it takes 17.33ns / 64-byte struct page.  Mike Rapoport
ran a similar test on a OpenPower (S812LC 8348-21C) with Power8 processor
and 128GB or RAM.  His results per 64-byte struct page were 4.68ns before,
and 4.59ns after this patch.

Link: http://lkml.kernel.org/r/154361477318.7497.13432441396440493352.stgit@ahduyck-desk1.amr.corp.intel.com
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Oscar Salvador [Wed, 5 Dec 2018 00:14:00 +0000 (11:14 +1100)]

mm/page_alloc.c: drop uneeded __meminit and __meminitdata

Since 03e85f9d5f1 ("mm/page_alloc: Introduce
free_area_init_core_hotplug"), some functions changed to only be called
during system initialization. Concretly, free_area_init_node() and the
functions that hang from it.

Also, some variables are no longer used after the system has gone through
initialization. So this could be considered as a late clean-up for that
patch.

This patch changes the functions from __meminit to __init, and the
variables from __meminitdata to __initdata.

In return, we get some KBs back:

Before:
Freeing unused kernel image memory: 2472K

After:
Freeing unused kernel image memory: 2480K

Link: http://lkml.kernel.org/r/20181204111507.4808-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Andrew Morton [Wed, 5 Dec 2018 00:13:58 +0000 (11:13 +1100)]

mm-dont-break-integrity-writeback-on-writepage-error-fix

fix typo in comment

Cc: Brian Foster <bfoster@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Brian Foster [Wed, 5 Dec 2018 00:13:57 +0000 (11:13 +1100)]

mm/page-writeback.c: don't break integrity writeback on ->writepage() error

write_cache_pages() is used in both background and integrity writeback
scenarios by various filesystems.  Background writeback is mostly
concerned with cleaning a certain number of dirty pages based on various
mm heuristics.  It may not write the full set of dirty pages or wait for
I/O to complete.  Integrity writeback is responsible for persisting a set
of dirty pages before the writeback job completes.  For example, an
fsync() call must perform integrity writeback to ensure data is on disk
before the call returns.

write_cache_pages() unconditionally breaks out of its processing loop in
the event of a ->writepage() error.  This is fine for background
writeback, which had no strict requirements and will eventually come
around again.  This can cause problems for integrity writeback on
filesystems that might need to clean up state associated with failed page
writeouts.  For example, XFS performs internal delayed allocation
accounting before returning a ->writepage() error, where applicable.  If
the current writeback happens to be associated with an unmount and
write_cache_pages() completes the writeback prematurely due to error, the
filesystem is unmounted in an inconsistent state if dirty+delalloc pages
still exist.

To handle this problem, update write_cache_pages() to always process the
full set of pages for integrity writeback regardless of ->writepage()
errors.  Save the first encountered error and return it to the caller once
complete.  This facilitates XFS (or any other fs that expects integrity
writeback to process the entire set of dirty pages) to clean up its
internal state completely in the event of persistent mapping errors.
Background writeback continues to exit on the first error encountered.

Link: http://lkml.kernel.org/r/20181116134304.32440-1-bfoster@redhat.com
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Wei Yang [Wed, 5 Dec 2018 00:13:57 +0000 (11:13 +1100)]

lib/show_mem.c: drop pgdat_resize_lock in show_mem()

Function show_mem() is used to print system memory status when user
requires or fail to allocate memory. Generally, this is a best effort
information so any races with memory hotplug (or very theoretically an
early initialization) should be tolerable and the worst that could happen
is to print an imprecise node state.

Drop the resize lock because this is the only place which might hold the
lock from the interrupt context and so all other callers might use a
simple spinlock. Even though this doesn't solve any real issue it makes
the code easier to follow and tiny more effective.

Link: http://lkml.kernel.org/r/20181129235532.9328-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

YueHaibing [Wed, 5 Dec 2018 00:13:57 +0000 (11:13 +1100)]

mm/hmm.c: remove set but not used variable 'devmem'

Fixes gcc '-Wunused-but-set-variable' warning:

mm/hmm.c: In function 'hmm_devmem_ref_kill':
mm/hmm.c:995:21: warning:
variable 'devmem' set but not used [-Wunused-but-set-variable]

It not used any more since 35d39f953d4e ("mm, hmm: replace
hmm_devmem_pages_create() with devm_memremap_pages()")

Link: http://lkml.kernel.org/r/1543629971-128377-1-git-send-email-yuehaibing@huawei.com
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

David Hildenbrand [Wed, 5 Dec 2018 00:13:57 +0000 (11:13 +1100)]

drivers/base/memory.c: use DEVICE_ATTR_RO and friends

Let's use the easier to read (and not mess up) variants:
- Use DEVICE_ATTR_RO
- Use DEVICE_ATTR_WO
- Use DEVICE_ATTR_RW
instead of the more generic DEVICE_ATTR() we're using right now.

We have to rename most callback functions. By fixing the intendations we
can even save some LOCs.

Link: http://lkml.kernel.org/r/20181203111611.10633-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Vineeth Remanan Pillai [Wed, 5 Dec 2018 00:13:57 +0000 (11:13 +1100)]

mm, swap: rid swapoff of quadratic complexity

This patch was initially posted by Kelley.  Reposting the patch with all
review comments addressed and with minor modifications and optimizations.
Tests were rerun and commit message updated with new results.

try_to_unuse() is of quadratic complexity, with a lot of wasted effort.
It unuses swap entries one by one, potentially iterating over all the page
tables for all the processes in the system for each one.

This new proposed implementation of try_to_unuse simplifies its complexity
to linear.  It iterates over the system's mms once, unusing all the
affected entries as it walks each set of page tables.  It also makes
similar changes to shmem_unuse.

Improvement

swapoff was called on a swap partition containing about 6G of data, in a
VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted.

Present implementation....about 1200M calls(8min, avg 80% cpu util).
Prototype.................about  9.0K calls(3min, avg 5% cpu util).

Details

In shmem_unuse(), iterate over the shmem_swaplist and, for each
shmem_inode_info that contains a swap entry, pass it to
shmem_unuse_inode(), along with the swap type.  In shmem_unuse_inode(),
iterate over its associated xarray, and store the index and value of each
swap entry in an array for passing to shmem_swapin_page() outside of the
RCU critical section.

In try_to_unuse(), instead of iterating over the entries in the type and
unusing them one by one, perhaps walking all the page tables for all the
processes for each one, iterate over the mmlist, making one pass.  Pass
each mm to unuse_mm() to begin its page table walk, and during the walk,
unuse all the ptes that have backing store in the swap type received by
try_to_unuse().  After the walk, check the type for orphaned swap entries
with find_next_to_unuse(), and remove them from the swap cache.  If
find_next_to_unuse() starts over at the beginning of the type, repeat the
check of the shmem_swaplist and the walk a maximum of three times.

Change unuse_mm() and the intervening walk functions down to
unuse_pte_range() to take the type as a parameter, and to iterate over
their entire range, calling the next function down on every iteration.  In
unuse_pte_range(), make a swap entry from each pte in the range using the
passed in type.  If it has backing store in the type, call
swapin_readahead() to retrieve the page and pass it to unuse_pte().

Pass the count of pages_to_unuse down the page table walks in
try_to_unuse(), and return from the walk when the desired number of pages
has been swapped back in.

Link: http://lkml.kernel.org/r/20181203170934.16512-2-vpillai@digitalocean.com
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Vineeth Remanan Pillai [Wed, 5 Dec 2018 00:13:56 +0000 (11:13 +1100)]

mm, swap: refactor swap-in logic out of shmem_getpage_gfp

Swapin logic could be reused independently without rest of the logic in
shmem_getpage_gfp. So lets refactor it out as an independent function.

Link: http://lkml.kernel.org/r/20181203170934.16512-1-vpillai@digitalocean.com
Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kelley Nielsen <kelleynnn@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Jerome Glisse [Wed, 5 Dec 2018 00:13:56 +0000 (11:13 +1100)]

mm/mmu_notifier: contextual information for event triggering invalidation

CPU page table update can happens for many reasons, not only as a result
of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
a result of kernel activities (memory compression, reclaim, migration,
...).

Users of mmu notifier API track changes to the CPU page table and take
specific action for them.  While current API only provide range of virtual
address affected by the change, not why the changes is happening.

This patchset adds event information so that users of mmu notifier can
differentiate among broad category:

    - UNMAP: munmap() or mremap()
    - CLEAR: page table is cleared (migration, compaction, reclaim, ...)
    - PROTECTION_VMA: change in access protections for the range
    - PROTECTION_PAGE: change in access protections for page in the range
    - SOFT_DIRTY: soft dirtyness tracking

Being able to identify munmap() and mremap() from other reasons why the
page table is cleared is important to allow user of mmu notifier to update
their own internal tracking structure accordingly (on munmap or mremap it
is not longer needed to track range of virtual address as it becomes
invalid).

Link: http://lkml.kernel.org/r/20181203201817.10759-4-jglisse@redhat.com
Signed-off-by: Jerome Glisse <jglisse@redhat.com>
Acked-by: Christian König <christian.koenig@amd.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Andrew Morton [Wed, 5 Dec 2018 00:13:56 +0000 (11:13 +1100)]

mm-mmu_notifier-use-structure-for-invalidate_range_start-end-calls-checkpatch-fixes

ERROR: code indent should use tabs where possible
#102: FILE: include/linux/mm.h:1506:
+^I^I ^I     struct mmu_notifier_range *range,$

WARNING: please, no space before tabs
#102: FILE: include/linux/mm.h:1506:
+^I^I ^I     struct mmu_notifier_range *range,$

WARNING: function definition argument 'struct mmu_notifier_range *' should also have an identifier name
#117: FILE: include/linux/mmu_notifier.h:223:
+extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *);

ERROR: code indent should use tabs where possible
#922: FILE: mm/memory.c:4122:
+^I^I ^I     struct mmu_notifier_range *range,$

WARNING: please, no space before tabs
#922: FILE: mm/memory.c:4122:
+^I^I ^I     struct mmu_notifier_range *range,$

WARNING: line over 80 characters
#1033: FILE: mm/mmu_notifier.c:183:
+ !range->blockable ? "non-" : "");

WARNING: line over 80 characters
#1170: FILE: mm/oom_kill.c:539:
+ if (mmu_notifier_invalidate_range_start_nonblock(&range)) {

WARNING: line over 80 characters
#1178: FILE: mm/oom_kill.c:544:
+ unmap_page_range(&tlb, vma, range.start, range.end, NULL);

total: 2 errors, 6 warnings, 1133 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
      mechanically convert to the typical style using --fix or --fix-inplace.

NOTE: Whitespace errors detected.
      You may wish to use scripts/cleanpatch or scripts/cleanfile

./patches/mm-mmu_notifier-use-structure-for-invalidate_range_start-end-calls.patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
      them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Andrew Morton [Wed, 5 Dec 2018 00:13:56 +0000 (11:13 +1100)]

mm-mmu_notifier-use-structure-for-invalidate_range_start-end-calls-fix

fix migrate_vma_collect(), per Jerome.

Cc: Christian König <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Jerome Glisse [Wed, 5 Dec 2018 00:13:55 +0000 (11:13 +1100)]

mm/mmu_notifier: use structure for invalidate_range_start/end calls

To avoid having to change many call sites everytime we want to add a
parameter use a structure to group all parameters for the mmu_notifier
invalidate_range_start/end cakks. No functional changes with this patch.

Link: http://lkml.kernel.org/r/20181203201817.10759-3-jglisse@redhat.com
Signed-off-by: Jerome Glisse <jglisse@redhat.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Jan Kara <jack@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Andrew Morton [Wed, 5 Dec 2018 00:13:55 +0000 (11:13 +1100)]

mm-mmu_notifier-use-structure-for-invalidate_range_start-end-callback-fix

fix CONFIG_MMU_NOTIFIER=n build

Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Jerome Glisse [Wed, 5 Dec 2018 00:13:55 +0000 (11:13 +1100)]

mm/mmu_notifier: use structure for invalidate_range_start/end callback

Patch series "mmu notifier contextual informations".

This patchset adds contextual information, why an invalidation is
happening, to mmu notifier callback.  This is necessary for user of mmu
notifier that wish to maintains their own data structure without having to
add new fields to struct vm_area_struct (vma).

For instance device can have they own page table that mirror the process
address space.  When a vma is unmap (munmap() syscall) the device driver
can free the device page table for the range.

Today we do not have any information on why a mmu notifier call back is
happening and thus device driver have to assume that it is always an
munmap().  This is inefficient at it means that it needs to re-allocate
device page table on next page fault and rebuild the whole device driver
data structure for the range.

Other use case beside munmap() also exist, for instance it is pointless
for device driver to invalidate the device page table when the
invalidation is for the soft dirtyness tracking.  Or device driver can
optimize away mprotect() that change the page table permission access for
the range.

This patchset enables all this optimizations for device drivers.  I do not
include any of those in this series but another patchset I am posting will
leverage this.

The patchset is pretty simple frmo a code point of view.  The first two
patches consolidate all mmu notifier arguments into a struct so that it is
easier to add/change arguments.  The last patch adds the contextual
information (munmap, protection, soft dirty, clear, ...).

This patch (of 3):

To avoid having to change many callback definition everytime we want to
add a parameter use a structure to group all parameters for the
mmu_notifier invalidate_range_start/end callback.  No functional changes
with this patch.

Link: http://lkml.kernel.org/r/20181203201817.10759-2-jglisse@redhat.com
Signed-off-by: Jerome Glisse <jglisse@redhat.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Jan Kara <jack@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ross Zwisler <zwisler@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Wei Yang [Wed, 5 Dec 2018 00:13:55 +0000 (11:13 +1100)]

mm, hotplug: move init_currently_empty_zone() under zone_span_lock protection

During online_pages phase, pgdat->nr_zones will be updated in case this
zone is empty.

Currently the online_pages phase is protected by the global locks
(device_device_hotplug_lock and mem_hotplug_lock), which ensures there is
no contention during the update of nr_zones.

These global locks introduces scalability issues (especially the second
one), which slow down code relying on get_online_mems(). This is also a
preparation for not having to rely on get_online_mems() but instead some
more fine grained locks.

The patch moves init_currently_empty_zone under both zone_span_writelock
and pgdat_resize_lock because both the pgdat state is changed (nr_zones)
and the zone's start_pfn. Also this patch changes the documentation of
node_size_lock to include the protection of nr_zones.

Link: http://lkml.kernel.org/r/20181203205016.14123-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Wei Yang [Wed, 5 Dec 2018 00:13:55 +0000 (11:13 +1100)]

mm, sparse: pass nid instead of pgdat to sparse_add_one_section()

Since the information needed in sparse_add_one_section() is node id to
allocate proper memory, it is not necessary to pass its pgdat.

This patch changes the prototype of sparse_add_one_section() to pass node
id directly. This is intended to reduce misleading that
sparse_add_one_section() would touch pgdat.

Link: http://lkml.kernel.org/r/20181204085657.20472-2-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Wei Yang [Wed, 5 Dec 2018 00:13:54 +0000 (11:13 +1100)]

mm-sparse-drop-pgdat_resize_lock-in-sparse_add-remove_one_section-v4

v4:
   * fix typo in changelog
   * adjust second paragraph of changelog
v3:
   * adjust the changelog with the reason for this change
   * remove a comment for pgdat_resize_lock
   * separate the prototype change of sparse_add_one_section() to
     another one

Link: http://lkml.kernel.org/r/20181204085657.20472-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Wei Yang [Wed, 5 Dec 2018 00:13:54 +0000 (11:13 +1100)]

mm, sparse: drop pgdat_resize_lock in sparse_add/remove_one_section()

pgdat_resize_lock is used to protect pgdat's memory region information
like: node_start_pfn, node_present_pages, etc.  While in function
sparse_add/remove_one_section(), pgdat_resize_lock is used to protect
initialization/release of one mem_section.  This looks not proper.

These code paths are currently protected by mem_hotplug_lock currently but
should there ever be any reason for locking at the sparse layer a
dedicated lock should be introduced.

Following is the current call trace of sparse_add/remove_one_section()

    mem_hotplug_begin()
    arch_add_memory()
       add_pages()
           __add_pages()
               __add_section()
                   sparse_add_one_section()
    mem_hotplug_done()

    mem_hotplug_begin()
    arch_remove_memory()
        __remove_pages()
            __remove_section()
                sparse_remove_one_section()
    mem_hotplug_done()

The comment above the pgdat_resize_lock also mentions "Holding this will
also guarantee that any pfn_valid() stays that way.", which is true with
the current implementation and false after this patch.  But current
implementation doesn't meet this comment.  There isn't any pfn walkers to
take the lock so this looks like a relict from the past.  This patch also
removes this comment.

[mhocko@suse.com: changelog suggestion]
Link: http://lkml.kernel.org/r/20181128091243.19249-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Yu Zhao [Wed, 5 Dec 2018 00:13:54 +0000 (11:13 +1100)]

mm: remove pte_lock_deinit()

Pagetable page doesn't touch page->mapping or have any used field that
overlaps with it. No need to clear mapping in dtor. In fact, doing so
might mask problems that otherwise would be detected by bad_page().

Link: http://lkml.kernel.org/r/20181128235525.58780-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree

Minchan Kim [Wed, 5 Dec 2018 00:13:54 +0000 (11:13 +1100)]

zram: writeback throttle

If there are lots of write IO with flash device, it could have a
wearout problem of storage. To overcome the problem, admin needs
to design write limitation to guarantee flash health
for entire product life.

This patch creates a new knob "writeback_limit" on zram.

writeback_limit's default value is 0 so that it doesn't limit
any writeback. If admin want to measure writeback count in a
certain period, he could know it via /sys/block/zram0/bd_stat's
3rd column.

If admin want to limit writeback as per-day 400M, he could do it
like below.

MB_SHIFT=20
4K_SHIFT=12
echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
/sys/block/zram0/writeback_limit.

If admin want to allow further write again, he could do it like below

echo 0 > /sys/block/zram0/writeback_limit

If admin want to see remaining writeback budget,

cat /sys/block/zram0/writeback_limit

The writeback_limit count will reset whenever you reset zram(e.g.,
system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of
writeback happened until you reset the zram to allocate extra writeback
budget in next setting is user's job.

Link: http://lkml.kernel.org/r/20181203024045.153534-8-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit | commitdiff | tree