Manfred Spraul [Wed, 2 Jun 2021 03:53:16 +0000 (13:53 +1000)]
ipc/util.c: use binary search for max_idx
If semctl(), msgctl() and shmctl() are called with IPC_INFO, SEM_INFO,
MSG_INFO or SHM_INFO, then the return value is the index of the highest
used index in the kernel's internal array recording information about all
SysV objects of the requested type for the current namespace. (This
information can be used with repeated ..._STAT or ..._STAT_ANY operations
to obtain information about all SysV objects on the system.)
There is a cache for this value. But when the cache needs up be updated,
then the highest used index is determined by looping over all possible
values. With the introduction of IPCMNI_EXTEND_SHIFT, this could be a
loop over 16 million entries. And due to /proc/sys/kernel/*next_id, the
index values do not need to be consecutive.
With <write 16000000 to msg_next_id>, msgget(), msgctl(,IPC_RMID) in a
loop, I have observed a performance increase of around factor 13000.
As there is no get_last() function for idr structures: Implement a
"get_last()" using a binary search.
As far as I see, ipc is the only user that needs get_last(), thus
implement it in ipc/util.c and not in a central location.
Link: https://lkml.kernel.org/r/20210425075208.11777-2-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: <1vier1@web.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vasily Averin [Wed, 2 Jun 2021 03:53:16 +0000 (13:53 +1000)]
ipc: use kmalloc for msg_queue and shmid_kernel
msg_queue and shmid_kernel are quite small objects, no need to use
kvmalloc for them. mhocko@: "Both of them are 256B on most 64b systems."
Previously these objects was allocated via ipc_alloc/ipc_rcu_alloc(),
common function for several ipc objects. It had kvmalloc call inside().
Later, this function went away and was finally replaced by direct kvmalloc
call, and now we can use more suitable kmalloc/kfree for them.
Link: https://lkml.kernel.org/r/0d0b6c9b-8af3-29d8-34e2-a565c53780f3@virtuozzo.com Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Vasily Averin [Wed, 2 Jun 2021 03:53:16 +0000 (13:53 +1000)]
ipc sem: use kvmalloc for sem_undo allocation
Patch series "ipc: allocations cleanup", v2.
Some ipc objects use the wrong allocation functions: small objects can use
kmalloc(), and vice versa, potentially large objects can use kmalloc().
This patch (of 2):
Size of sem_undo can exceed one page and with the maximum possible nsems =
32000 it can grow up to 64Kb. Let's switch its allocation to kvmalloc to
avoid user-triggered disruptive actions like OOM killer in case of
high-order memory shortage.
User triggerable high order allocations are quite a problem on heavily
fragmented systems. They can be a DoS vector.
Link: https://lkml.kernel.org/r/ebc3ac79-3190-520d-81ce-22ad194986ec@virtuozzo.com Link: https://lkml.kernel.org/r/a6354fd9-2d55-2e63-dd4d-fa7dc1d11134@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Marco Elver [Wed, 2 Jun 2021 03:53:15 +0000 (13:53 +1000)]
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
add comment explaining __has_feature() in Clang
Link: https://lkml.kernel.org/r/20210527194448.3470080-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Marco Elver [Wed, 2 Jun 2021 03:53:15 +0000 (13:53 +1000)]
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
Until now no compiler supported an attribute to disable coverage
instrumentation as used by KCOV.
To work around this limitation on x86, noinstr functions have their
coverage instrumentation turned into nops by objtool. However, this
solution doesn't scale automatically to other architectures, such as
arm64, which are migrating to use the generic entry code.
Add __no_sanitize_coverage for both compilers, and add it to noinstr.
Note: In the Clang case, __has_feature(coverage_sanitizer) is only true if
the feature is enabled, and therefore we do not require an additional
defined(CONFIG_KCOV) (like in the GCC case where __has_attribute(..) is
always true) to avoid adding redundant attributes to functions if KCOV is
off. That being said, compilers that support the attribute will not
generate errors/warnings if the attribute is redundantly used; however,
where possible let's avoid it as it reduces preprocessed code size and
associated compile-time overheads.
Link: https://lkml.kernel.org/r/20210525175819.699786-1-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Will Deacon <will@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Cc: Arvind Sankar <nivedita@alum.mit.edu> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Al Viro [Wed, 2 Jun 2021 03:53:14 +0000 (13:53 +1000)]
x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
Currently we handle SS_AUTODISARM as soon as we have stored the altstack
settings into sigframe - that's the point when we have set the things up
for eventual sigreturn to restore the old settings. And if we manage to
set the sigframe up (we are not done with that yet), everything's fine.
However, in case of failure we end up with sigframe-to-be abandoned and
SIGSEGV force-delivered. And in that case we end up with inconsistent
rules - late failures have altstack reset, early ones do not.
It's trivial to get consistent behaviour - just handle SS_AUTODISARM once
we have set the sigframe up and are committed to entering the handler,
i.e. in signal_delivered().
Gustavo A. R. Silva [Wed, 2 Jun 2021 03:53:14 +0000 (13:53 +1000)]
hfsplus: fix out-of-bounds warnings in __hfsplus_setxattr
Fix the following out-of-bounds warnings by enclosing structure members
file and finder into new struct info:
fs/hfsplus/xattr.c:300:5: warning: 'memcpy' offset [65, 80] from the object at 'entry' is out of the bounds of referenced subobject 'user_info' with type 'struct DInfo' at offset 48 [-Warray-bounds]
fs/hfsplus/xattr.c:313:5: warning: 'memcpy' offset [65, 80] from the object at 'entry' is out of the bounds of referenced subobject 'user_info' with type 'struct FInfo' at offset 48 [-Warray-bounds]
Refactor the code by making it more "structured."
Also, this helps with the ongoing efforts to enable -Warray-bounds and
makes the code clearer and avoid confusing the compiler.
Matthew said:
: The offending line is this:
:
: - memcpy(&entry.file.user_info, value,
: + memcpy(&entry.file.info, value,
: file_finderinfo_len);
:
: what it's trying to do is copy two structs which are adjacent to each
: other in a single call to memcpy(). gcc legitimately complains that
: the memcpy to this struct overruns the bounds of the struct. What
: Gustavo has done here is introduce a new struct that contains the two
: structs, and now gcc is happy that the memcpy doesn't overrun the
: length of this containing struct.
Andrew Halaney [Wed, 2 Jun 2021 03:53:14 +0000 (13:53 +1000)]
init/main.c: silence some -Wunused-parameter warnings
There are a bunch of callbacks with unused arguments, go ahead and silence
those so "make KCFLAGS=-W init/main.o" is a little quieter. Here's a
little sample:
Andrew Halaney [Wed, 2 Jun 2021 03:53:14 +0000 (13:53 +1000)]
init: print out unknown kernel parameters
It is easy to foobar setting a kernel parameter on the command line
without realizing it, there's not much output that you can use to assess
what the kernel did with that parameter by default.
Make it a little more explicit which parameters on the command line
_looked_ like a valid parameter for the kernel, but did not match anything
and ultimately got tossed to init. This is very similar to the unknown
parameter message received when loading a module.
This assumes the parameters are processed in a normal fashion, some
parameters (dyndbg= for example) don't register their parameter with the
rest of the kernel's parameters, and therefore always show up in this list
(and are also given to init - like the rest of this list).
Another example is BOOT_IMAGE= is highlighted as an offender, which it
technically is, but is passed by LILO and GRUB so most systems will see
that complaint.
An example output where "foobared" and "unrecognized" are intentionally
invalid parameters:
Guenter Roeck [Wed, 2 Jun 2021 03:53:14 +0000 (13:53 +1000)]
checkpatch: scripts/spdxcheck.py now requires python3
Since commit d0259c42abff ("spdxcheck.py: Use Python 3"), spdxcheck.py
explicitly expects to run as python3 script. If "python" still points to
python v2.7 and the script is executed with "python scripts/spdxcheck.py",
the following error may be seen even if git-python is installed for
python3.
Traceback (most recent call last):
File "scripts/spdxcheck.py", line 10, in <module>
import git
ImportError: No module named git
To fix the problem, check for the existence of python3, check if
the script is executable and not just for its existence, and execute
it directly.
Link: https://lkml.kernel.org/r/20210505211720.447111-1-linux@roeck-us.net Signed-off-by: Guenter Roeck <linux@roeck-us.net> Cc: Joe Perches <joe@perches.com> Cc: Bert Vermeulen <bert@biot.com> Cc: Dwaipayan Ray <dwaipayanray1@gmail.com> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> Cc: Bert Vermeulen <bert@biot.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Dimitri John Ledkov [Wed, 2 Jun 2021 03:53:13 +0000 (13:53 +1000)]
lib/decompress_unlz4.c: correctly handle zero-padding around initrds.
lz4 compatible decompressor is simple. The format is underspecified and
relies on EOF notification to determine when to stop. Initramfs buffer
format[1] explicitly states that it can have arbitrary number of zero
padding. Thus when operating without a fill function, be extra careful to
ensure that sizes less than 4, or apperantly empty chunksizes are treated
as EOF.
To test this I have created two cpio initrds, first a normal one,
main.cpio. And second one with just a single /test-file with content
"second" second.cpio. Then i compressed both of them with gzip, and with
lz4 -l. Then I created a padding of 4 bytes (dd if=/dev/zero of=pad4 bs=1
count=4). To create four testcase initrds:
The pad4 test-cases replicate the initrd load by grub, as it pads and
aligns every initrd it loads.
All of the above boot, however /test-file was not accessible in the initrd
for the testcase #4, as decoding in lz4 decompressor failed. Also an
error message printed which usually is harmless.
Whith a patched kernel, all of the above testcases now pass, and
/test-file is accessible.
This fixes lz4 initrd decompress warning on every boot with grub. And
more importantly this fixes inability to load multiple lz4 compressed
initrds with grub. This patch has been shipping in Ubuntu kernels since
January 2021.
Trent Piepho [Wed, 2 Jun 2021 03:53:12 +0000 (13:53 +1000)]
lib/math/rational.c: fix divide by zero
If the input is out of the range of the allowed values, either larger than
the largest value or closer to zero than the smallest non-zero allowed
value, then a division by zero would occur.
In the case of input too large, the division by zero will occur on the
first iteration. The best result (largest allowed value) will be found by
always choosing the semi-convergent and excluding the denominator based
limit when finding it.
In the case of the input too small, the division by zero will occur on the
second iteration. The numerator based semi-convergent should not be
calculated to avoid the division by zero. But the semi-convergent vs
previous convergent test is still needed, which effectively chooses
between 0 (the previous convergent) vs the smallest allowed fraction (best
semi-convergent) as the result.
Link: https://lkml.kernel.org/r/20210525144250.214670-1-tpiepho@gmail.com Fixes: 323dd2c3ed0 ("lib/math/rational.c: fix possible incorrect result from rational fractions helper") Signed-off-by: Trent Piepho <tpiepho@gmail.com> Reported-by: Yiyuan Guo <yguoaz@gmail.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Oskar Schirmer <oskar@scara.com> Cc: Daniel Latypov <dlatypov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andy Shevchenko [Wed, 2 Jun 2021 03:53:12 +0000 (13:53 +1000)]
seq_file: drop unused *_escape_mem_ascii()
There are no more users of the seq_escape_mem_ascii() followed by
string_escape_mem_ascii().
Remove them for good.
Link: https://lkml.kernel.org/r/20210504180819.73127-16-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andy Shevchenko [Wed, 2 Jun 2021 03:53:12 +0000 (13:53 +1000)]
nfsd: avoid non-flexible API in seq_quote_mem()
The seq_escape_mem_ascii() is completely non-flexible and shouldn't be
used. Replace it with properly called seq_escape_mem().
Link: https://lkml.kernel.org/r/20210504180819.73127-15-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andy Shevchenko [Wed, 2 Jun 2021 03:53:12 +0000 (13:53 +1000)]
seq_file: convert seq_escape() to use seq_escape_str()
Convert seq_escape() to use seq_escape_str() rather than open coding it.
Note, for now we leave it as an exported symbol due to some old code that
can't tolerate ctype.h being (indirectly) included.
Link: https://lkml.kernel.org/r/20210504180819.73127-14-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andy Shevchenko [Wed, 2 Jun 2021 03:53:12 +0000 (13:53 +1000)]
seq_file: add seq_escape_str() as replica of string_escape_str()
In some cases we want to escape characters from NULL-terminated strings.
Add seq_escape_str() as replica of string_escape_str() for that.
Link: https://lkml.kernel.org/r/20210504180819.73127-13-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andy Shevchenko [Wed, 2 Jun 2021 03:53:11 +0000 (13:53 +1000)]
seq_file: introduce seq_escape_mem()
Introduce seq_escape_mem() to allow users to pass additional parameters to
string_escape_mem().
Link: https://lkml.kernel.org/r/20210504180819.73127-12-andriy.shevchenko@linux.intel.com Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andy Shevchenko [Wed, 2 Jun 2021 03:53:11 +0000 (13:53 +1000)]
MAINTAINERS: add myself as designated reviewer for generic string library
Add myself as designated reviewer for generic string library.
Link: https://lkml.kernel.org/r/20210504180819.73127-11-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andy Shevchenko [Wed, 2 Jun 2021 03:53:11 +0000 (13:53 +1000)]
lib/test-string_helpers: add test cases for new features
We have got new flags and hence new features of string_escape_mem().
Add test cases for that.
Link: https://lkml.kernel.org/r/20210504180819.73127-10-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andy Shevchenko [Wed, 2 Jun 2021 03:53:11 +0000 (13:53 +1000)]
lib/test-string_helpers: get rid of trailing comma in terminators
Terminators by definition shouldn't accept anything behind. Make them
robust by removing trailing commas.
Link: https://lkml.kernel.org/r/20210504180819.73127-9-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andy Shevchenko [Wed, 2 Jun 2021 03:53:11 +0000 (13:53 +1000)]
lib/test-string_helpers: print flags in hexadecimal format
Since flags are bitmapped, it's better to print them in hexadecimal
format.
Link: https://lkml.kernel.org/r/20210504180819.73127-8-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andy Shevchenko [Wed, 2 Jun 2021 03:53:10 +0000 (13:53 +1000)]
lib/string_helpers: allow to append additional characters to be escaped
Introduce a new flag to append additional characters, passed in 'only'
parameter, to be escaped if they fall in the corresponding class.
Link: https://lkml.kernel.org/r/20210504180819.73127-7-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andy Shevchenko [Wed, 2 Jun 2021 03:53:10 +0000 (13:53 +1000)]
lib/string_helpers: introduce ESCAPE_NAP to escape non-ASCII and non-printable
Some users may want to have an ASCII based filter for printable only
characters, provided by conjunction of isascii() and isprint() functions.
Here is the addition of a such.
Link: https://lkml.kernel.org/r/20210504180819.73127-6-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andy Shevchenko [Wed, 2 Jun 2021 03:53:10 +0000 (13:53 +1000)]
lib/string_helpers: introduce ESCAPE_NA for escaping non-ASCII
Some users may want to have an ASCII based filter, provided by isascii()
function. Here is the addition of a such.
Link: https://lkml.kernel.org/r/20210504180819.73127-5-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andy Shevchenko [Wed, 2 Jun 2021 03:53:10 +0000 (13:53 +1000)]
lib/string_helpers: drop indentation level in string_escape_mem()
The only one conditional is left on the upper level, move the rest to the
same level and drop indentation level. No functional changes.
Link: https://lkml.kernel.org/r/20210504180819.73127-4-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andy Shevchenko [Wed, 2 Jun 2021 03:53:10 +0000 (13:53 +1000)]
lib/string_helpers: move ESCAPE_NP check inside 'else' branch in a loop
Refactor code to have better readability by moving ESCAPE_NP handling
inside 'else' branch in the loop.
No functional change intended.
Link: https://lkml.kernel.org/r/20210504180819.73127-3-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andy Shevchenko [Wed, 2 Jun 2021 03:53:09 +0000 (13:53 +1000)]
lib/string_helpers: switch to use BIT() macro
Patch series "lib/string_helpers: get rid of ugly *_escape_mem_ascii()", v3.
Get rid of ugly *_escape_mem_ascii() API since it's not flexible and has
the only single user. Provide better approach based on usage of the
string_escape_mem() with appropriate flags.
Test cases has been expanded accordingly to cover new functionality.
This patch (of 15):
Switch to use BIT() macro for flag definitions. No changes implied.
Andy Shevchenko [Wed, 2 Jun 2021 03:53:09 +0000 (13:53 +1000)]
kernel.h: split out panic and oops helpers (ia64 fix)
Follow up fix for the "kernel.h: split out panic and oops helpers" patch.
Link: https://lkml.kernel.org/r/20210520130557.55277-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reported-by: kernel test robot <lkp@intel.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrew Morton [Wed, 2 Jun 2021 03:53:09 +0000 (13:53 +1000)]
kernelh-split-out-panic-and-oops-helpers-fix
thread_info.h needs limits.h
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andy Shevchenko [Wed, 2 Jun 2021 03:53:09 +0000 (13:53 +1000)]
kernel.h: split out panic and oops helpers
kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out panic and
oops helpers.
There are several purposes of doing this:
- dropping dependency in bug.h
- dropping a loop by moving out panic_notifier.h
- unload kernel.h from something which has its own domain
At the same time convert users tree-wide to use new headers, although for
the time being include new header back to kernel.h to avoid twisted
indirected includes for existing users.
Link: https://lkml.kernel.org/r/20210511074137.33666-1-andriy.shevchenko@linux.intel.com Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org> Co-developed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Corey Minyard <cminyard@mvista.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Wei Liu <wei.liu@kernel.org> Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Sebastian Reichel <sre@kernel.org> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: Stephen Boyd <sboyd@kernel.org> Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Acked-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Julius Hemanth Pitti [Wed, 2 Jun 2021 03:53:08 +0000 (13:53 +1000)]
proc/sysctl: make protected_* world readable
protected_* files have 600 permissions which prevents non-superuser from
reading them.
Container like "AWS greengrass" refuse to launch unless
protected_hardlinks and protected_symlinks are set. When containers like
these run with "userns-remap" or "--user" mapping container's root to
non-superuser on host, they fail to run due to denied read access to these
files.
As these protections are hardly a secret, and do not possess any security
risk, making them world readable.
Though above greengrass usecase needs read access to only
protected_hardlinks and protected_symlinks files, setting all other
protected_* files to 644 to keep consistency.
Link: http://lkml.kernel.org/r/20200709235115.56954-1-jpitti@cisco.com Fixes: 800179c9b8a1 ("fs: add link restrictions") Signed-off-by: Julius Hemanth Pitti <jpitti@cisco.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Kalesh Singh [Wed, 2 Jun 2021 03:53:08 +0000 (13:53 +1000)]
procfs/dmabuf: add inode number to /proc/*/fdinfo
And 'ino' field to /proc/<pid>/fdinfo/<FD> and
/proc/<pid>/task/<tid>/fdinfo/<FD>.
The inode numbers can be used to uniquely identify DMA buffers in user
space and avoids a dependency on /proc/<pid>/fd/* when accounting
per-process DMA buffer sizes.
Link: https://lkml.kernel.org/r/20210308170651.919148-2-kaleshsingh@google.com Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Christian König <christian.koenig@amd.com> Cc: Jann Horn <jannh@google.com> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Hridya Valsaraju <hridya@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Alexey Gladkov <gladkov.alexey@gmail.com> Cc: Szabolcs Nagy <szabolcs.nagy@arm.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Michel Lespinasse <walken@google.com> Cc: Bernd Edlinger <bernd.edlinger@hotmail.de> Cc: Andrei Vagin <avagin@gmail.com> Cc: Helge Deller <deller@gmx.de> Cc: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Kalesh Singh [Wed, 2 Jun 2021 03:53:08 +0000 (13:53 +1000)]
procfs: allow reading fdinfo with PTRACE_MODE_READ
Android captures per-process system memory state when certain low memory
events (e.g a foreground app kill) occur, to identify potential memory
hoggers. In order to measure how much memory a process actually consumes,
it is necessary to include the DMA buffer sizes for that process in the
memory accounting. Since the handle to DMA buffers are raw FDs, it is
important to be able to identify which processes have FD references to a
DMA buffer.
Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and
/proc/<pid>/fdinfo -- both are only readable by the process owner, as
follows:
1. Do a readlink on each FD.
2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD.
3. stat the file to get the dmabuf inode number.
4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size.
Accessing other processes' fdinfo requires root privileges. This limits
the use of the interface to debugging environments and is not suitable for
production builds. Granting root privileges even to a system process
increases the attack surface and is highly undesirable.
Since fdinfo doesn't permit reading process memory and manipulating
process state, allow accessing fdinfo under PTRACE_MODE_READ_FSCRED.
Link: https://lkml.kernel.org/r/20210308170651.919148-1-kaleshsingh@google.com Signed-off-by: Kalesh Singh <kaleshsingh@google.com> Suggested-by: Jann Horn <jannh@google.com> Acked-by: Christian König <christian.koenig@amd.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alexey Gladkov <gladkov.alexey@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Bernd Edlinger <bernd.edlinger@hotmail.de> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Helge Deller <deller@gmx.de> Cc: Hridya Valsaraju <hridya@google.com> Cc: James Morris <jamorris@linux.microsoft.com> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <walken@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Szabolcs Nagy <szabolcs.nagy@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
ZHOUFENG [Wed, 2 Jun 2021 03:53:07 +0000 (13:53 +1000)]
fs/proc/kcore.c: add mmap interface
When we do the kernel monitor, use the DRGN
(https://github.com/osandov/drgn) access to kernel data structures, found
that the system calls a lot. DRGN is implemented by reading /proc/kcore.
After looking at the kcore code, it is found that kcore does not implement
mmap, resulting in frequent context switching triggered by read.
Therefore, we want to add mmap interface to optimize performance. Since
vmalloc and module areas will change with allocation and release,
consistency cannot be guaranteed, so mmap interface only maps KCORE_TEXT
and KCORE_RAM.
The test results:
1. the default version of kcore
real 11.00
user 8.53
sys 3.59
Use size_t when capping the count argument received by mem_rw(). Since
count is size_t, using min_t(int, ...) can lead to a negative value
that will later be passed to access_remote_vm(), which can cause
unexpected behavior.
Since we are capping the value to at maximum PAGE_SIZE, the conversion
from size_t to int when passing it to access_remote_vm() as "len"
shouldn't be a problem.
Link: https://lkml.kernel.org/r/20210512125215.3348316-1-marcelo.cerri@canonical.com Reviewed-by: David Disseldorp <ddiss@suse.de> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Souza Cascardo <cascardo@canonical.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Michel Lespinasse <walken@google.com> Cc: Helge Deller <deller@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Tetsuo Handa [Wed, 2 Jun 2021 03:53:07 +0000 (13:53 +1000)]
kernel/hung_task.c: Monitor killed tasks.
syzbot's current top report is "no output from test machine" where the
userspace process failed to spawn a new test process for 300 seconds for
some reason. One of reasons which can result in this report is that an
already spawned test process was unable to terminate (e.g. trapped at an
unkillable retry loop due to some bug) after SIGKILL was sent to that
process. Therefore, reporting when a thread is failing to terminate
despite a fatal signal is pending would give us more useful information.
In the context of syzbot's testing where there are only 2 CPUs in the
target VM (which means that only small number of threads and not so much
memory) and threads get SIGKILL after 5 seconds from fork(), being unable
to reach do_exit() within 10 seconds is likely a sign of something went
wrong. Therefore, I would like to try this patch in linux-next.git for
feasibility testing whether this patch helps finding more bugs and
reproducers for such bugs, by bringing "unable to terminate threads"
reports out of "no output from test machine" reports.
Potential bad effect of this patch will be that kernel code becomes
killable without addressing the root cause of being unable to terminate,
for use of killable wait will bypass both TASK_UNINTERRUPTIBLE stall test
and SIGKILL after 5 seconds behavior, which will result in failing to
detect in real systems where SIGKILL won't be sent after 5 seconds when
something went wrong.
This version shares existing sysctl settings (e.g. check interval,
timeout, whether to panic) used for detecting TASK_UNINTERRUPTIBLE
threads. We will likely want to use different sysctl settings for
monitoring killed threads. But let's start as linux-next.git patch
without introducing new sysctl settings. We can add sysctl settings
before sending to linux.git.
Link: http://lkml.kernel.org/r/60d1d7f6-b201-3dcb-a51b-76a31bcfa919@i-love.sakura.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Liu Chuansheng <chuansheng.liu@intel.com> Cc: Valdis Kletnieks <valdis.kletnieks@vt.edu> Cc: linux-kernel@vger.kernel.org Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Tetsuo Handa [Wed, 2 Jun 2021 03:53:06 +0000 (13:53 +1000)]
fs/buffer.c: add debug print for __getblk_gfp() stall problem
Among syzbot's unresolved hung task reports, 18 out of 65 reports contain
__getblk_gfp() line in the backtrace. Since there is a comment block that
says that __getblk_gfp() will lock up the machine if try_to_free_buffers()
attempt from grow_dev_page() is failing, let's start from checking whether
syzbot is hitting that case. This change will be removed after the bug is
fixed.
Link: http://lkml.kernel.org/r/9b9fcdda-c347-53ee-fdbb-8a7d11cf430e@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Layton <jlayton@redhat.com> Cc: <syzkaller-bugs@googlegroups.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Marco Elver [Wed, 2 Jun 2021 03:53:06 +0000 (13:53 +1000)]
kfence: unconditionally use unbound work queue
Unconditionally use unbound work queue, and not just if wq_power_efficient
is true. Because if the system is idle, KFENCE may wait, and by being run
on the unbound work queue, we permit the scheduler to make better
scheduling decisions and not require pinning KFENCE to the same CPU upon
waking up.
Link: https://lkml.kernel.org/r/20210521111630.472579-1-elver@google.com Fixes: 36f0b35d0894 ("kfence: use power-efficient work queue to run delayed work") Signed-off-by: Marco Elver <elver@google.com> Reported-by: Hillf Danton <hdanton@sina.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:53:06 +0000 (13:53 +1000)]
mm/swap: make NODE_DATA an inline function on CONFIG_FLATMEM
make W=1 generates the following warning in mm/workingset.c for allnoconfig
mm/workingset.c: In function `unpack_shadow':
mm/workingset.c:201:15: warning: variable `nid' set but not used [-Wunused-but-set-variable]
int memcgid, nid;
^~~
On FLATMEM, NODE_DATA returns a global pglist_data without dereferencing
nid. Make the helper an inline function to suppress the warning, add type
checking and to apply any side-effects in the parameter list.
Link: https://lkml.kernel.org/r/20210520084809.8576-15-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:53:06 +0000 (13:53 +1000)]
mm/page_alloc: move prototype for find_suitable_fallback
make W=1 generates the following warning in mmap_lock.c for allnoconfig
mm/page_alloc.c:2670:5: warning: no previous prototype for `find_suitable_fallback' [-Wmissing-prototypes]
int find_suitable_fallback(struct free_area *area, unsigned int order,
^~~~~~~~~~~~~~~~~~~~~~
find_suitable_fallback is only shared outside of page_alloc.c for
CONFIG_COMPACTION but to suppress the warning, move the protype outside of
CONFIG_COMPACTION. It is not worth the effort at this time to find a
clever way of allowing compaction.c to share the code or avoid the use
entirely as the function is called on relatively slow paths.
Link: https://lkml.kernel.org/r/20210520084809.8576-14-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Bixuan Cui [Wed, 2 Jun 2021 03:53:06 +0000 (13:53 +1000)]
mm/mmap_lock: fix warning when CONFIG_TRACING is not defined
Fix the warning: [-Wunused-function]
mm/mmap_lock.c:157:20: warning: `get_mm_memcg_path' defined but not used
static const char *get_mm_memcg_path(struct mm_struct *mm)
^~~~~~~~~~~~~~~~~
Move get_mm_memcg_path() into #ifdef CONFIG_TRACING.
Link: https://lkml.kernel.org/r/20210531033426.74031-1-cuibixuan@huawei.com Signed-off-by: Bixuan Cui <cuibixuan@huawei.com> Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Yang Shi <shy828301@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:53:05 +0000 (13:53 +1000)]
mm/mmap_lock: remove dead code for !CONFIG_TRACING configurations
make W=1 generates the following warning in mmap_lock.c for allnoconfig
mm/mmap_lock.c:213:6: warning: no previous prototype for `__mmap_lock_do_trace_start_locking' [-Wmissing-prototypes]
void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write)
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/mmap_lock.c:219:6: warning: no previous prototype for `__mmap_lock_do_trace_acquire_returned' [-Wmissing-prototypes]
void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/mmap_lock.c:226:6: warning: no previous prototype for `__mmap_lock_do_trace_released' [-Wmissing-prototypes]
void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write)
On !CONFIG_TRACING configurations, the code is dead so put it behind an
#ifdef.
Link: https://lkml.kernel.org/r/20210520084809.8576-13-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:53:05 +0000 (13:53 +1000)]
mm/swap: make swap_address_space an inline function
make W=1 generates the following warning in page_mapping() for allnoconfig
mm/util.c:700:15: warning: variable `entry' set but not used [-Wunused-but-set-variable]
swp_entry_t entry;
^~~~~
swap_address is a #define on !CONFIG_SWAP configurations. Make the helper
an inline function to suppress the warning, add type checking and to apply
any side-effects in the parameter list.
Link: https://lkml.kernel.org/r/20210520084809.8576-12-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:53:05 +0000 (13:53 +1000)]
mm/z3fold: add kerneldoc fields for z3fold_pool
make W=1 generates the following warning for z3fold_pool
mm/z3fold.c:171: warning: Function parameter or member 'zpool' not described in 'z3fold_pool'
mm/z3fold.c:171: warning: Function parameter or member 'zpool_ops' not described in 'z3fold_pool'
Commit 9a001fc19ccc ("z3fold: the 3-fold allocator for compressed pages")
simply did not document the fields at the time. Add rudimentary
documentation.
Link: https://lkml.kernel.org/r/20210520084809.8576-11-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:53:05 +0000 (13:53 +1000)]
mm/zbud: add kerneldoc fields for zbud_pool
make W=1 generates the following warning for zbud_pool
mm/zbud.c:105: warning: Function parameter or member 'zpool' not described in 'zbud_pool'
mm/zbud.c:105: warning: Function parameter or member 'zpool_ops' not described in 'zbud_pool'
Commit 479305fd7172 ("zpool: remove zpool_evict()") removed the
zpool_evict helper and added the associated zpool and operations structure
in struct zbud_pool but did not add documentation for the fields. Add
rudimentary documentation.
Link: https://lkml.kernel.org/r/20210520084809.8576-10-mgorman@techsingularity.net Fixes: 479305fd7172 ("zpool: remove zpool_evict()") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:53:05 +0000 (13:53 +1000)]
mm/memory_hotplug: fix kerneldoc comment for __remove_memory
make W=1 generates the following warning for __remove_memory
mm/memory_hotplug.c:2044: warning: expecting prototype for remove_memory(). Prototype was for __remove_memory() instead
Commit eca499ab3749 ("mm/hotplug: make remove_memory() interface usable")
introduced the kerneldoc comment and function but the kerneldoc name and
function name did not match.
Link: https://lkml.kernel.org/r/20210520084809.8576-9-mgorman@techsingularity.net Fixes: eca499ab3749 ("mm/hotplug: make remove_memory() interface usable") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:53:04 +0000 (13:53 +1000)]
mm/memory_hotplug: fix kerneldoc comment for __try_online_node
make W=1 generates the following warning for try_online_node
mm/memory_hotplug.c:1087: warning: expecting prototype for try_online_node(). Prototype was for __try_online_node() instead
Commit b9ff036082cd ("mm/memory_hotplug.c: make add_memory_resource use
__try_online_node") renamed the function but did not update the associated
kerneldoc. The function is static and somewhat specialised in nature so
it's not clear it warrants being a kerneldoc by moving the comment to
try_online_node. Hence, leave the comment of the internal helper in place
but leave it out of kerneldoc and correct the function name in the
comment.
Link: https://lkml.kernel.org/r/20210520084809.8576-8-mgorman@techsingularity.net Fixes: Commit b9ff036082cd ("mm/memory_hotplug.c: make add_memory_resource use __try_online_node") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:53:04 +0000 (13:53 +1000)]
mm/memcontrol.c: fix kerneldoc comment for mem_cgroup_calculate_protection
make W=1 generates the following warning for mem_cgroup_calculate_protection
mm/memcontrol.c:6468: warning: expecting prototype for mem_cgroup_protected(). Prototype was for mem_cgroup_calculate_protection() instead
Commit 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from
protection checks") changed the function definition but not the associated
kerneldoc comment.
Link: https://lkml.kernel.org/r/20210520084809.8576-7-mgorman@techsingularity.net Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:53:04 +0000 (13:53 +1000)]
mm/mapping_dirty_helpers: remove double Note in kerneldoc
make W=1 generates the following warning for mm/mapping_dirty_helpers.c
mm/mapping_dirty_helpers.c:325: warning: duplicate section name 'Note'
The helper function is very specific to one driver -- vmwgfx. While the
two notes are separate, all of it needs to be taken into account when
using the helper so make it one note.
Link: https://lkml.kernel.org/r/20210520084809.8576-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:53:04 +0000 (13:53 +1000)]
mm/page_alloc: make should_fail_alloc_page() static
make W=1 generates the following warning for mm/page_alloc.c
mm/page_alloc.c:3651:15: warning: no previous prototype for `should_fail_alloc_page' [-Wmissing-prototypes]
noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
^~~~~~~~~~~~~~~~~~~~~~
This function is deliberately split out for BPF to allow errors to be
injected. The function is not used anywhere else so it is local to the
file. Make it static which should still allow error injection to be used
similar to how block/blk-core.c:should_fail_bio() works.
Link: https://lkml.kernel.org/r/20210520084809.8576-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:53:04 +0000 (13:53 +1000)]
mm/vmalloc: include header for prototype of set_iounmap_nonlazy
make W=1 generates the following warning for mm/vmalloc.c
mm/vmalloc.c:1599:6: warning: no previous prototype for `set_iounmap_nonlazy' [-Wmissing-prototypes]
void set_iounmap_nonlazy(void)
^~~~~~~~~~~~~~~~~~~
This is an arch-generic function only used by x86. On other arches, it's
dead code. Include the header with the definition and make it x86-64
specific.
Link: https://lkml.kernel.org/r/20210520084809.8576-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Streetman <ddstreet@ieee.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Mel Gorman [Wed, 2 Jun 2021 03:53:03 +0000 (13:53 +1000)]
mm/vmscan: remove kerneldoc-like comment from isolate_lru_pages
Patch series "Clean W=1 build warnings for mm/".
This is a janitorial only. During development of a tool to catch build
warnings early to avoid tripping the Intel lkp-robot, I noticed that mm/
is not clean for W=1. This is generally harmless but there is no harm in
cleaning it up. It disrupts git blame a little but on relatively obvious
lines that are unlikely to be git blame targets.
This patch (of 13):
make W=1 generates the following warning for vmscan.c
mm/vmscan.c:1814: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
It is not a kerneldoc comment and isolate_lru_pages() is a static
function. While the detailed comment is nice, it does not need to be
exposed via kernel-doc.
Zhen Lei [Wed, 2 Jun 2021 03:53:03 +0000 (13:53 +1000)]
mm: fix spelling mistakes
Fix some spelling mistakes in comments:
each having differents usage ==> each has a different usage
statments ==> statements
adresses ==> addresses
aggresive ==> aggressive
datas ==> data
posion ==> poison
higer ==> higher
precisly ==> precisely
wont ==> won't
We moves tha ==> We move the
endianess ==> endianness
Link: https://lkml.kernel.org/r/20210519065853.7723-2-thunder.leizhen@huawei.com Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Anshuman Khandual [Wed, 2 Jun 2021 03:53:03 +0000 (13:53 +1000)]
mm: define default value for FIRST_USER_ADDRESS
Currently most platforms define FIRST_USER_ADDRESS as 0UL duplication the
same code all over. Instead just define a generic default value (i.e 0UL)
for FIRST_USER_ADDRESS and let the platforms override when required. This
makes it much cleaner with reduced code.
The default FIRST_USER_ADDRESS here would be skipped in <linux/pgtable.h>
when the given platform overrides its value via <asm/pgtable.h>.
Link: https://lkml.kernel.org/r/1620615725-24623-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Acked-by: Guo Ren <guoren@kernel.org> [csky] Acked-by: Stafford Horne <shorne@gmail.com> [openrisc] Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com> [RISC-V] Cc: Richard Henderson <rth@twiddle.net> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Guo Ren <guoren@kernel.org> Cc: Brian Cain <bcain@codeaurora.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Stafford Horne <shorne@gmail.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Jan Kara [Wed, 2 Jun 2021 03:53:03 +0000 (13:53 +1000)]
mm: fix comments mentioning i_mutex
inode->i_mutex has been replaced with inode->i_rwsem long ago. Fix
comments still mentioning i_mutex.
Link: https://lkml.kernel.org/r/20210429094141.12819-1-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Yue Hu [Wed, 2 Jun 2021 03:53:02 +0000 (13:53 +1000)]
zram: move backing_dev under macro CONFIG_ZRAM_WRITEBACK
backing_dev is never used when not enable CONFIG_ZRAM_WRITEBACK and it's
introduced from writeback feature. So it's needless also affect
readability in that case.
Link: https://lkml.kernel.org/r/20210521060544.2385-1-zbestahu@gmail.com Signed-off-by: Yue Hu <huyue2@yulong.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Ira Weiny [Wed, 2 Jun 2021 03:53:02 +0000 (13:53 +1000)]
mm/highmem: Remove deprecated kmap_atomic
kmap_atomic() is being deprecated in favor of kmap_local_page().
Replace the uses of kmap_atomic() within the highmem code.
On profiling clear_huge_page() using ftrace an improvement of 62% was
observed on the below setup.
Setup:-
Below data has been collected on Qualcomm's SM7250 SoC THP enabled
(kernel v4.19.113) with only CPU-0(Cortex-A55) and CPU-7(Cortex-A76)
switched on and set to max frequency, also DDR set to perf governor.
FTRACE Data:-
Base data:-
Number of iterations: 48
Mean of allocation time: 349.5 us
std deviation: 74.5 us
v4 data:-
Number of iterations: 48
Mean of allocation time: 131 us
std deviation: 32.7 us
The following simple userspace experiment to allocate
100MB(BUF_SZ) of pages and writing to it gave us a good insight,
we observed an improvement of 42% in allocation and writing timings.
-------------------------------------------------------------
Test code snippet
-------------------------------------------------------------
clock_start();
buf = malloc(BUF_SZ); /* Allocate 100 MB of memory */
Base data:-
Number of iterations: 100
Mean of allocation time: 31831 us
std deviation: 4286 us
v4 data:-
Number of iterations: 100
Mean of allocation time: 18193 us
std deviation: 4915 us
Link: https://lkml.kernel.org/r/20210204073255.20769-2-prathu.baronia@oneplus.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Prathu Baronia <prathu.baronia@oneplus.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miaohe Lin [Wed, 2 Jun 2021 03:53:02 +0000 (13:53 +1000)]
mm/zswap.c: fix two bugs in zswap_writeback_entry()
In the ZSWAP_SWAPCACHE_FAIL and ZSWAP_SWAPCACHE_EXIST case, we forgot to
call zpool_unmap_handle() when zpool can't sleep. And we might sleep in
zswap_get_swap_cache_page() while zpool can't sleep. To fix all of these,
zpool_unmap_handle() should be done before zswap_get_swap_cache_page()
when zpool can't sleep.
Link: https://lkml.kernel.org/r/20210522092242.3233191-4-linmiaohe@huawei.com Fixes: fc6697a89f56 ("mm/zswap: add the flag can_sleep_mapped") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Seth Jennings <sjenning@redhat.com> Cc: Tian Tao <tiantao6@hisilicon.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miaohe Lin [Wed, 2 Jun 2021 03:53:02 +0000 (13:53 +1000)]
mm/zswap.c: avoid unnecessary copy-in at map time
The buf mapped via zpool_map_handle() is only used to store compressed
page buffer and there is no information to extract from it. So we could
use ZPOOL_MM_WO instead to avoid unnecessary copy-in at map time.
Link: https://lkml.kernel.org/r/20210522092242.3233191-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Colin Ian King <colin.king@canonical.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Seth Jennings <sjenning@redhat.com> Cc: Tian Tao <tiantao6@hisilicon.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Miaohe Lin [Wed, 2 Jun 2021 03:53:02 +0000 (13:53 +1000)]
mm/zswap.c: remove unused function zswap_debugfs_exit()
Patch series "Cleanup and fixup for zswap".
This series contains cleanups to remove unused function and avoid
unnecessary copy-in at map time. Also this fixes two bugs in the function
zswap_writeback_entry(). More details can be found in the respective
changelogs.
This patch (of 3):
zswap_debugfs_exit() is unused, remove it.
Link: https://lkml.kernel.org/r/20210522092242.3233191-1-linmiaohe@huawei.com Link: https://lkml.kernel.org/r/20210522092242.3233191-2-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Colin Ian King <colin.king@canonical.com> Cc: Tian Tao <tiantao6@hisilicon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrew Morton [Wed, 2 Jun 2021 03:53:01 +0000 (13:53 +1000)]
mm-rmap-make-try_to_unmap-void-function-fix-fix
fix merge snafu
Cc: Hugh Dickins <hughd@google.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrew Morton [Wed, 2 Jun 2021 03:53:01 +0000 (13:53 +1000)]
mm-rmap-make-try_to_unmap-void-function-fix
fix comment grammar
Cc: Hugh Dickins <hughd@google.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Yang Shi [Wed, 2 Jun 2021 03:53:01 +0000 (13:53 +1000)]
mm: rmap: make try_to_unmap() void function
Currently try_to_unmap() return bool value by checking page_mapcount(),
however this may return false positive since page_mapcount() doesn't
check all subpages of compound page. The total_mapcount() could be used
instead, but its cost is higher since it traverses all subpages.
Actually the most callers of try_to_unmap() don't care about the
return value at all. So just need check if page is still mapped by
page_mapped() when necessary. And page_mapped() does bail out early
when it finds mapped subpage.
Link: https://lkml.kernel.org/r/20210526201239.3351-2-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Suggested-by: Hugh Dickins <hughd@google.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Andrew Morton [Wed, 2 Jun 2021 03:53:01 +0000 (13:53 +1000)]
mmmemory_hotplug-drop-unneeded-locking-fix
remove now-unused locals
Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Oscar Salvador [Wed, 2 Jun 2021 03:53:01 +0000 (13:53 +1000)]
mm,memory_hotplug: drop unneeded locking
Currently, memory-hotplug code takes zone's span_writelock and pgdat's
resize_lock when resizing the node/zone's spanned pages via
{move_pfn_range_to_zone(),remove_pfn_range_from_zone()} and when resizing
node and zone's present pages via adjust_present_page_count().
These locks are also taken during the initialization of the system at boot
time, where it protects parallel struct page initialization, but they
should not really be needed in memory-hotplug where all operations are a)
synchronized on device level and b) serialized by the mem_hotplug_lock
lock.
Link: https://lkml.kernel.org/r/20210531093958.15021-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
David Hildenbrand [Wed, 2 Jun 2021 03:53:00 +0000 (13:53 +1000)]
memory-hotplug.rst: complete admin-guide overhaul
The memory hot(un)plug documentation is outdated and incomplete. Most of
the content dates back to 2007, so it's time for a major overhaul.
Let's rewrite, reorganize and update most parts of the documentation. In
addition to memory hot(un)plug, also add some details regarding
ZONE_MOVABLE, with memory hotunplug being one of its main consumers.
The style of the document is also properly fixed that e.g., "restview"
renders it cleanly now.
In the future, we might add some more details about virt users like
virtio-mem, the XEN balloon, the Hyper-V balloon and ppc64 dlpar.
Link: https://lkml.kernel.org/r/20210525102604.8770-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
When offlining memory the system can attempt to migrate a lot of pages, if
there are problems with migration this can flood the logs. Printing all
the data hogs the CPU and cause some RT threads to run for a long time,
which may have some bad consequences.
Rate limit the page migration warnings in order to avoid this.
Link: https://lkml.kernel.org/r/20210505140542.24935-1-georgi.djakov@linaro.org Signed-off-by: Liam Mark <lmark@codeaurora.org> Signed-off-by: Georgi Djakov <georgi.djakov@linaro.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
David Hildenbrand [Wed, 2 Jun 2021 03:53:00 +0000 (13:53 +1000)]
selftests/vm: add test for MADV_POPULATE_(READ|WRITE)
Let's add a simple test for MADV_POPULATE_READ and MADV_POPULATE_WRITE,
verifying some error handling, that population works, and that softdirty
tracking works as expected. For now, limit the test to private anonymous
memory.
Link: https://lkml.kernel.org/r/20210419135443.12822-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Shuah Khan <shuah@kernel.org> Cc: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
WARNING: please, no space before tabs
#506: FILE: mm/gup.c:1513:
+^I * ^I the page dirty with FOLL_WRITE -- which doesn't make a$
WARNING: please, no space before tabs
#507: FILE: mm/gup.c:1514:
+^I * ^I difference with !FOLL_FORCE, because the page is writable$
WARNING: please, no space before tabs
#508: FILE: mm/gup.c:1515:
+^I * ^I in the page table.$
total: 0 errors, 3 warnings, 214 lines checked
NOTE: For some of the reported defects, checkpatch may be able to
mechanically convert to the typical style using --fix or --fix-inplace.
./patches/mm-madvise-introduce-madv_populate_readwrite-to-prefault-page-tables.patch has style problems, please review.
NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
David Hildenbrand [Wed, 2 Jun 2021 03:52:59 +0000 (13:52 +1000)]
mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables
I. Background: Sparse Memory Mappings
When we manage sparse memory mappings dynamically in user space - also
sometimes involving MAP_NORESERVE - we want to dynamically populate/
discard memory inside such a sparse memory region. Example users are
hypervisors (especially implementing memory ballooning or similar
technologies like virtio-mem) and memory allocators. In addition, we want
to fail in a nice way (instead of generating SIGBUS) if populating does
not succeed because we are out of backend memory (which can happen easily
with file-based mappings, especially tmpfs and hugetlbfs).
While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for
reliably discarding memory for most mapping types, there is no generic
approach to populate page tables and preallocate memory.
Although mmap() supports MAP_POPULATE, it is not applicable to the concept
of sparse memory mappings, where we want to populate/discard dynamically
and avoid expensive/problematic remappings. In addition, we never
actually report errors during the final populate phase - it is best-effort
only.
fallocate() can be used to preallocate file-based memory and fail in a
safe way. However, it cannot really be used for any private mappings on
anonymous files via memfd due to COW semantics. In addition, fallocate()
does not actually populate page tables, so we still always get pagefaults
on first access - which is sometimes undesired (i.e., real-time workloads)
and requires real prefaulting of page tables, not just a preallocation of
backend storage. There might be interesting use cases for sparse memory
regions along with mlockall(MCL_ONFAULT) which fallocate() cannot satisfy
as it does not prefault page tables.
II. On preallcoation/prefaulting from user space
Because we don't have a proper interface, what applications (like QEMU and
databases) end up doing is touching (i.e., reading+writing one byte to not
overwrite existing data) all individual pages.
However, that approach
1) Can result in wear on storage backing, because we end up reading/writing
each page; this is especially a problem for dax/pmem.
2) Can result in mmap_sem contention when prefaulting via multiple
threads.
3) Requires expensive signal handling, especially to catch SIGBUS in case
of hugetlbfs/shmem/file-backed memory. For example, this is
problematic in hypervisors like QEMU where SIGBUS handlers might already
be used by other subsystems concurrently to e.g, handle hardware errors.
"Simply" doing preallocation concurrently from other thread is not that
easy.
III. On MADV_WILLNEED
Extending MADV_WILLNEED is not an option because
1. It would change the semantics: "Expect access in the near future." and
"might be a good idea to read some pages" vs. "Definitely populate/
preallocate all memory and definitely fail on errors.".
2. Existing users (like virtio-balloon in QEMU when deflating the balloon)
don't want populate/prealloc semantics. They treat this rather as a hint
to give a little performance boost without too much overhead - and don't
expect that a lot of memory might get consumed or a lot of time
might be spent.
IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE
Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE, inspired by
MAP_POPULATE, with the following semantics:
1. MADV_POPULATE_READ can be used to prefault page tables just like
manually reading each individual page. This will not break any COW
mappings. The shared zero page might get mapped and no backend storage
might get preallocated -- allocation might be deferred to
write-fault time. Especially shared file mappings require an explicit
fallocate() upfront to actually preallocate backend memory (blocks in
the file system) in case the file might have holes.
2. If MADV_POPULATE_READ succeeds, all page tables have been populated
(prefaulted) readable once.
3. MADV_POPULATE_WRITE can be used to preallocate backend memory and
prefault page tables just like manually writing (or
reading+writing) each individual page. This will break any COW
mappings -- e.g., the shared zeropage is never populated.
4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated
(prefaulted) writable once.
5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special
mappings marked with VM_PFNMAP and VM_IO. Also, proper access
permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such
mapping is encountered, madvise() fails with -EINVAL.
6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables
might have been populated.
7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON
when encountering a HW poisoned page in the range.
8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE
cannot protect from the OOM (Out Of Memory) handler killing the
process.
While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
preallocate memory and prefault page tables for VMs), one issue is that
whenever we prefault pages writable, the pages have to be marked dirty,
because the CPU could dirty them any time. while not a real problem for
hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each
page will be marked dirty and has to be written back later when evicting.
MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole
mapping from backend storage without marking it dirty, such that eviction
won't have to write it back. As discussed above, shared file mappings
might require an explciit fallocate() upfront to achieve
preallcoation+prepopulation.
Although sparse memory mappings are the primary use case, this will also
be useful for other preallocate/prefault use cases where MAP_POPULATE is
not desired or the semantics of MAP_POPULATE are not sufficient: as one
example, QEMU users can trigger preallocation/prefaulting of guest RAM
after the mapping was created -- and don't want errors to be silently
suppressed.
Looking at the history, MADV_POPULATE was already proposed in 2013 [1],
however, the main motivation back than was performance improvements --
which should also still be the case.
V. Single-threaded performance comparison
I did a short experiment, prefaulting page tables on completely *empty
mappings/files* and repeated the experiment 10 times. The results
correspond to the shortest execution time. In general, the performance
benefit for huge pages is negligible with small mappings.
V.1: Private mappings
POPULATE_READ and POPULATE_WRITE is fastest. Note that
Reading/POPULATE_READ will populate the shared zeropage where applicable
-- which result in short population times.
The fastest way to allocate backend storage (here: swap or huge pages) and
prefault page tables is POPULATE_WRITE.
V.2: Shared mappings
fallocate() is fastest, however, doesn't prefault page tables.
POPULATE_WRITE is faster than simple writes and read/writes.
POPULATE_READ is faster than simple reads.
Without a fd, the fastest way to allocate backend storage and prefault
page tables is POPULATE_WRITE. With an fd, the fastest way is usually
FALLOCATE+POPULATE_READ or FALLOCATE+POPULATE_WRITE respectively; one
exception are actual files: FALLOCATE+Read is slightly faster than
FALLOCATE+POPULATE_READ.
The fastest way to allocate backend storage prefault page tables is
FALLOCATE+POPULATE_WRITE -- except when dealing with actual files; then,
FALLOCATE+POPULATE_READ is fastest and won't directly mark all pages as
dirty.
v.3: Detailed results
==================================================
2 MiB MAP_PRIVATE:
**************************************************
Anon 4 KiB : Read : 0.119 ms
Anon 4 KiB : Write : 0.222 ms
Anon 4 KiB : Read/Write : 0.380 ms
Anon 4 KiB : POPULATE_READ : 0.060 ms
Anon 4 KiB : POPULATE_WRITE : 0.158 ms
Memfd 4 KiB : Read : 0.034 ms
Memfd 4 KiB : Write : 0.310 ms
Memfd 4 KiB : Read/Write : 0.362 ms
Memfd 4 KiB : POPULATE_READ : 0.039 ms
Memfd 4 KiB : POPULATE_WRITE : 0.229 ms
Memfd 2 MiB : Read : 0.030 ms
Memfd 2 MiB : Write : 0.030 ms
Memfd 2 MiB : Read/Write : 0.030 ms
Memfd 2 MiB : POPULATE_READ : 0.030 ms
Memfd 2 MiB : POPULATE_WRITE : 0.030 ms
tmpfs : Read : 0.033 ms
tmpfs : Write : 0.313 ms
tmpfs : Read/Write : 0.406 ms
tmpfs : POPULATE_READ : 0.039 ms
tmpfs : POPULATE_WRITE : 0.285 ms
file : Read : 0.033 ms
file : Write : 0.351 ms
file : Read/Write : 0.408 ms
file : POPULATE_READ : 0.039 ms
file : POPULATE_WRITE : 0.290 ms
hugetlbfs : Read : 0.030 ms
hugetlbfs : Write : 0.030 ms
hugetlbfs : Read/Write : 0.030 ms
hugetlbfs : POPULATE_READ : 0.030 ms
hugetlbfs : POPULATE_WRITE : 0.030 ms
**************************************************
4096 MiB MAP_PRIVATE:
**************************************************
Anon 4 KiB : Read : 237.940 ms
Anon 4 KiB : Write : 708.409 ms
Anon 4 KiB : Read/Write : 1054.041 ms
Anon 4 KiB : POPULATE_READ : 124.310 ms
Anon 4 KiB : POPULATE_WRITE : 572.582 ms
Memfd 4 KiB : Read : 136.928 ms
Memfd 4 KiB : Write : 963.898 ms
Memfd 4 KiB : Read/Write : 1106.561 ms
Memfd 4 KiB : POPULATE_READ : 78.450 ms
Memfd 4 KiB : POPULATE_WRITE : 805.881 ms
Memfd 2 MiB : Read : 357.116 ms
Memfd 2 MiB : Write : 357.210 ms
Memfd 2 MiB : Read/Write : 357.606 ms
Memfd 2 MiB : POPULATE_READ : 356.094 ms
Memfd 2 MiB : POPULATE_WRITE : 356.937 ms
tmpfs : Read : 137.536 ms
tmpfs : Write : 954.362 ms
tmpfs : Read/Write : 1105.954 ms
tmpfs : POPULATE_READ : 80.289 ms
tmpfs : POPULATE_WRITE : 822.826 ms
file : Read : 137.874 ms
file : Write : 987.025 ms
file : Read/Write : 1107.439 ms
file : POPULATE_READ : 80.413 ms
file : POPULATE_WRITE : 857.622 ms
hugetlbfs : Read : 355.607 ms
hugetlbfs : Write : 355.729 ms
hugetlbfs : Read/Write : 356.127 ms
hugetlbfs : POPULATE_READ : 354.585 ms
hugetlbfs : POPULATE_WRITE : 355.138 ms
**************************************************
2 MiB MAP_SHARED:
**************************************************
Anon 4 KiB : Read : 0.394 ms
Anon 4 KiB : Write : 0.348 ms
Anon 4 KiB : Read/Write : 0.400 ms
Anon 4 KiB : POPULATE_READ : 0.326 ms
Anon 4 KiB : POPULATE_WRITE : 0.273 ms
Anon 2 MiB : Read : 0.030 ms
Anon 2 MiB : Write : 0.030 ms
Anon 2 MiB : Read/Write : 0.030 ms
Anon 2 MiB : POPULATE_READ : 0.030 ms
Anon 2 MiB : POPULATE_WRITE : 0.030 ms
Memfd 4 KiB : Read : 0.412 ms
Memfd 4 KiB : Write : 0.372 ms
Memfd 4 KiB : Read/Write : 0.419 ms
Memfd 4 KiB : POPULATE_READ : 0.343 ms
Memfd 4 KiB : POPULATE_WRITE : 0.288 ms
Memfd 4 KiB : FALLOCATE : 0.137 ms
Memfd 4 KiB : FALLOCATE+Read : 0.446 ms
Memfd 4 KiB : FALLOCATE+Write : 0.330 ms
Memfd 4 KiB : FALLOCATE+Read/Write : 0.454 ms
Memfd 4 KiB : FALLOCATE+POPULATE_READ : 0.379 ms
Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 0.268 ms
Memfd 2 MiB : Read : 0.030 ms
Memfd 2 MiB : Write : 0.030 ms
Memfd 2 MiB : Read/Write : 0.030 ms
Memfd 2 MiB : POPULATE_READ : 0.030 ms
Memfd 2 MiB : POPULATE_WRITE : 0.030 ms
Memfd 2 MiB : FALLOCATE : 0.030 ms
Memfd 2 MiB : FALLOCATE+Read : 0.031 ms
Memfd 2 MiB : FALLOCATE+Write : 0.031 ms
Memfd 2 MiB : FALLOCATE+Read/Write : 0.031 ms
Memfd 2 MiB : FALLOCATE+POPULATE_READ : 0.030 ms
Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 0.030 ms
tmpfs : Read : 0.416 ms
tmpfs : Write : 0.369 ms
tmpfs : Read/Write : 0.425 ms
tmpfs : POPULATE_READ : 0.346 ms
tmpfs : POPULATE_WRITE : 0.295 ms
tmpfs : FALLOCATE : 0.139 ms
tmpfs : FALLOCATE+Read : 0.447 ms
tmpfs : FALLOCATE+Write : 0.333 ms
tmpfs : FALLOCATE+Read/Write : 0.454 ms
tmpfs : FALLOCATE+POPULATE_READ : 0.380 ms
tmpfs : FALLOCATE+POPULATE_WRITE : 0.272 ms
file : Read : 0.191 ms
file : Write : 0.511 ms
file : Read/Write : 0.524 ms
file : POPULATE_READ : 0.196 ms
file : POPULATE_WRITE : 0.434 ms
file : FALLOCATE : 0.004 ms
file : FALLOCATE+Read : 0.197 ms
file : FALLOCATE+Write : 0.554 ms
file : FALLOCATE+Read/Write : 0.480 ms
file : FALLOCATE+POPULATE_READ : 0.201 ms
file : FALLOCATE+POPULATE_WRITE : 0.381 ms
hugetlbfs : Read : 0.030 ms
hugetlbfs : Write : 0.030 ms
hugetlbfs : Read/Write : 0.030 ms
hugetlbfs : POPULATE_READ : 0.030 ms
hugetlbfs : POPULATE_WRITE : 0.030 ms
hugetlbfs : FALLOCATE : 0.030 ms
hugetlbfs : FALLOCATE+Read : 0.031 ms
hugetlbfs : FALLOCATE+Write : 0.031 ms
hugetlbfs : FALLOCATE+Read/Write : 0.030 ms
hugetlbfs : FALLOCATE+POPULATE_READ : 0.030 ms
hugetlbfs : FALLOCATE+POPULATE_WRITE : 0.030 ms
**************************************************
4096 MiB MAP_SHARED:
**************************************************
Anon 4 KiB : Read : 1053.090 ms
Anon 4 KiB : Write : 913.642 ms
Anon 4 KiB : Read/Write : 1060.350 ms
Anon 4 KiB : POPULATE_READ : 893.691 ms
Anon 4 KiB : POPULATE_WRITE : 782.885 ms
Anon 2 MiB : Read : 358.553 ms
Anon 2 MiB : Write : 358.419 ms
Anon 2 MiB : Read/Write : 357.992 ms
Anon 2 MiB : POPULATE_READ : 357.533 ms
Anon 2 MiB : POPULATE_WRITE : 357.808 ms
Memfd 4 KiB : Read : 1078.144 ms
Memfd 4 KiB : Write : 942.036 ms
Memfd 4 KiB : Read/Write : 1100.391 ms
Memfd 4 KiB : POPULATE_READ : 925.829 ms
Memfd 4 KiB : POPULATE_WRITE : 804.394 ms
Memfd 4 KiB : FALLOCATE : 304.632 ms
Memfd 4 KiB : FALLOCATE+Read : 1163.359 ms
Memfd 4 KiB : FALLOCATE+Write : 933.186 ms
Memfd 4 KiB : FALLOCATE+Read/Write : 1187.304 ms
Memfd 4 KiB : FALLOCATE+POPULATE_READ : 1013.660 ms
Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 794.560 ms
Memfd 2 MiB : Read : 358.131 ms
Memfd 2 MiB : Write : 358.099 ms
Memfd 2 MiB : Read/Write : 358.250 ms
Memfd 2 MiB : POPULATE_READ : 357.563 ms
Memfd 2 MiB : POPULATE_WRITE : 357.334 ms
Memfd 2 MiB : FALLOCATE : 356.735 ms
Memfd 2 MiB : FALLOCATE+Read : 358.152 ms
Memfd 2 MiB : FALLOCATE+Write : 358.331 ms
Memfd 2 MiB : FALLOCATE+Read/Write : 358.018 ms
Memfd 2 MiB : FALLOCATE+POPULATE_READ : 357.286 ms
Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 357.523 ms
tmpfs : Read : 1087.265 ms
tmpfs : Write : 950.840 ms
tmpfs : Read/Write : 1107.567 ms
tmpfs : POPULATE_READ : 922.605 ms
tmpfs : POPULATE_WRITE : 810.094 ms
tmpfs : FALLOCATE : 306.320 ms
tmpfs : FALLOCATE+Read : 1169.796 ms
tmpfs : FALLOCATE+Write : 933.730 ms
tmpfs : FALLOCATE+Read/Write : 1191.610 ms
tmpfs : FALLOCATE+POPULATE_READ : 1020.474 ms
tmpfs : FALLOCATE+POPULATE_WRITE : 798.945 ms
file : Read : 654.101 ms
file : Write : 1259.142 ms
file : Read/Write : 1289.509 ms
file : POPULATE_READ : 661.642 ms
file : POPULATE_WRITE : 1106.816 ms
file : FALLOCATE : 1.864 ms
file : FALLOCATE+Read : 656.328 ms
file : FALLOCATE+Write : 1153.300 ms
file : FALLOCATE+Read/Write : 1180.613 ms
file : FALLOCATE+POPULATE_READ : 668.347 ms
file : FALLOCATE+POPULATE_WRITE : 996.143 ms
hugetlbfs : Read : 357.245 ms
hugetlbfs : Write : 357.413 ms
hugetlbfs : Read/Write : 357.120 ms
hugetlbfs : POPULATE_READ : 356.321 ms
hugetlbfs : POPULATE_WRITE : 356.693 ms
hugetlbfs : FALLOCATE : 355.927 ms
hugetlbfs : FALLOCATE+Read : 357.074 ms
hugetlbfs : FALLOCATE+Write : 357.120 ms
hugetlbfs : FALLOCATE+Read/Write : 356.983 ms
hugetlbfs : FALLOCATE+POPULATE_READ : 356.413 ms
hugetlbfs : FALLOCATE+POPULATE_WRITE : 356.266 ms
**************************************************
[1] https://lkml.org/lkml/2013/6/27/698
Link: https://lkml.kernel.org/r/20210419135443.12822-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Kefeng Wang [Wed, 2 Jun 2021 03:52:59 +0000 (13:52 +1000)]
mm: generalize ZONE_[DMA|DMA32]
ZONE_[DMA|DMA32] configs have duplicate definitions on platforms that
subscribe to them. Instead, just make them generic options which can be
selected on applicable platforms.
Also only x86/arm64 architectures could enable both ZONE_DMA and
ZONE_DMA32 if EXPERT, add ARCH_HAS_ZONE_DMA_SET to make dma zone
configurable and visible on the two architectures.
Link: https://lkml.kernel.org/r/20210528074557.17768-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k] Acked-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com> [RISC-V] Acked-by: Michal Simek <michal.simek@xilinx.com> [microblaze] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Richard Henderson <rth@twiddle.net> Cc: Russell King <linux@armlinux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Anshuman Khandual [Wed, 2 Jun 2021 03:52:58 +0000 (13:52 +1000)]
mm/thp: make ARCH_ENABLE_SPLIT_PMD_PTLOCK dependent on PGTABLE_LEVELS > 2
ARCH_ENABLE_SPLIT_PMD_PTLOCK is irrelevant unless there are more than two
page table levels including PMD (also per
Documentation/vm/split_page_table_lock.rst). Make this dependency
explicit on remaining platforms i.e x86 and s390 where
ARCH_ENABLE_SPLIT_PMD_PTLOCK is subscribed.
Yang Shi [Wed, 2 Jun 2021 03:52:58 +0000 (13:52 +1000)]
mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split
When debugging the bug reported by Wang Yugui [1], try_to_unmap() may
fail, but the first VM_BUG_ON_PAGE() just checks page_mapcount() however
it may miss the failure when head page is unmapped but other subpage is
mapped. Then the second DEBUG_VM BUG() that check total mapcount would
catch it. This may incur some confusion. And this is not a fatal issue,
so consolidate the two DEBUG_VM checks into one VM_WARN_ON_ONCE_PAGE().
Link: https://lkml.kernel.org/r/20210526201239.3351-1-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Anshuman Khandual [Wed, 2 Jun 2021 03:52:58 +0000 (13:52 +1000)]
mm/thp: make ALLOC_SPLIT_PTLOCKS dependent on USE_SPLIT_PTE_PTLOCKS
Split ptlocks need not be defined and allocated unless they are being
used. ALLOC_SPLIT_PTLOCKS is inherently dependent on
USE_SPLIT_PTE_PTLOCKS. This just makes it explicit and clear. While here
drop the spinlock_t element from the struct page when
USE_SPLIT_PTE_PTLOCKS is not enabled.
Link: https://lkml.kernel.org/r/1621409586-5555-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Yang Shi [Wed, 2 Jun 2021 03:52:58 +0000 (13:52 +1000)]
mm: thp: skip make PMD PROT_NONE if THP migration is not supported
A quick grep shows x86_64, PowerPC (book3s), ARM64 and S390 support both
NUMA balancing and THP. But S390 doesn't support THP migration so NUMA
balancing actually can't migrate any misplaced pages.
Skip make PMD PROT_NONE for such case otherwise CPU cycles may be wasted
by pointless NUMA hinting faults on S390.
Link: https://lkml.kernel.org/r/20210518200801.7413-8-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Yang Shi [Wed, 2 Jun 2021 03:52:57 +0000 (13:52 +1000)]
mm: migrate: check mapcount for THP instead of refcount
The generic migration path will check refcount, so no need check refcount
here. But the old code actually prevents from migrating shared THP
(mapped by multiple processes), so bail out early if mapcount is > 1 to
keep the behavior.
Link: https://lkml.kernel.org/r/20210518200801.7413-7-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Yang Shi [Wed, 2 Jun 2021 03:52:57 +0000 (13:52 +1000)]
mm: migrate: don't split THP for misplaced NUMA page
The old behavior didn't split THP if migration is failed due to lack of
memory on the target node. But the THP migration does split THP, so keep
the old behavior for misplaced NUMA page migration.
Link: https://lkml.kernel.org/r/20210518200801.7413-6-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>