Petr Mladek [Wed, 11 Nov 2020 13:54:50 +0000 (14:54 +0100)]
printk/console: Allow to disable console output by using console="" or console=null
The commit 48021f98130880dd74 ("printk: handle blank console arguments
passed in.") prevented crash caused by empty console= parameter value.
Unfortunately, this value is widely used on Chromebooks to disable
the console output. The above commit caused performance regression
because the messages were pushed on slow console even though nobody
was watching it.
Use ttynull driver explicitly for console="" and console=null
parameters. It has been created for exactly this purpose.
It causes that preferred_console is set. As a result, ttySX and ttyX
are not used as a fallback. And only ttynull console gets registered by
default.
It still allows to register other consoles either by additional console=
parameters or SPCR. It prevents regression because it worked this way even
before. Also it is a sane semantic. Preventing output on all consoles
should be done another way, for example, by introducing mute_console
parameter.
Petr Mladek [Wed, 11 Nov 2020 13:54:49 +0000 (14:54 +0100)]
init/console: Use ttynull as a fallback when there is no console
stdin, stdout, and stderr standard I/O stream are created for the init
process. They are not available when there is no console registered
for /dev/console. It might lead to a crash when the init process
tries to use them, see the commit 48021f98130880dd742 ("printk: handle
blank console arguments passed in.").
Normally, ttySX and ttyX consoles are used as a fallback when no consoles
are defined via the command line, device tree, or SPCR. But there
will be no console registered when an invalid console name is configured
or when the configured consoles do not exist on the system.
Users even try to avoid the console intentionally, for example,
by using console="" or console=null. It is used on production
systems where the serial port or terminal are not visible to
users. Pushing messages to these consoles would just unnecessary
slowdown the system.
Make sure that stdin, stdout, stderr, and /dev/console are always
available by a fallback to the existing ttynull driver. It has
been implemented for exactly this purpose but it was used only
when explicitly configured.
Linus Torvalds [Fri, 16 Oct 2020 19:52:37 +0000 (12:52 -0700)]
Merge tag 'printk-for-5.10-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk fix from Petr Mladek:
"Prevent overflow in the new lockless ringbuffer"
* tag 'printk-for-5.10-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk: ringbuffer: Wrong data pointer when appending small string
Linus Torvalds [Fri, 16 Oct 2020 19:47:18 +0000 (12:47 -0700)]
Merge tag 'kgdb-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux
Pull kgdb updates from Daniel Thompson:
"A fairly modest set of changes for this cycle.
Of particular note are an earlycon fix from Doug Anderson and my own
changes to get kgdb/kdb to honour the kprobe blocklist. The later
creates a safety rail that strongly encourages developers not to place
breakpoints in, for example, arch specific trap handling code.
Also included are a couple of small fixes and tweaks: an API update,
eliminate a coverity dead code warning, improved handling of search
during multi-line printk and a couple of typo corrections"
* tag 'kgdb-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
kdb: Fix pager search for multi-line strings
kernel: debug: Centralize dbg_[de]activate_sw_breakpoints
kgdb: Add NOKPROBE labels on the trap handler functions
kgdb: Honour the kprobe blocklist when setting breakpoints
kernel/debug: Fix spelling mistake in debug_core.c
kdb: Use newer api for tasklist scanning
kgdb: Make "kgdbcon" work properly with "kgdb_earlycon"
kdb: remove unnecessary null check of dbg_io_ops
Linus Torvalds [Fri, 16 Oct 2020 19:40:55 +0000 (12:40 -0700)]
Merge tag 'mips_5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS updates from Thomas Bogendoerfer:
- removed support for PNX833x alias NXT_STB22x
- included Ingenic SoC support into generic MIPS kernels
- added support for new Ingenic SoCs
- converted workaround selection to use Kconfig
- replaced old boot mem functions by memblock_*
- enabled COP2 usage in kernel for Loongson64 to make use
of 16byte load/stores possible
- cleanups and fixes
* tag 'mips_5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: (92 commits)
MIPS: DEC: Restore bootmem reservation for firmware working memory area
MIPS: dec: fix section mismatch
bcm963xx_tag.h: fix duplicated word
mips: ralink: enable zboot support
MIPS: ingenic: Remove CPU_SUPPORTS_HUGEPAGES
MIPS: cpu-probe: remove MIPS_CPU_BP_GHIST option bit
MIPS: cpu-probe: introduce exclusive R3k CPU probe
MIPS: cpu-probe: move fpu probing/handling into its own file
MIPS: replace add_memory_region with memblock
MIPS: Loongson64: Clean up numa.c
MIPS: Loongson64: Select SMP in Kconfig to avoid build error
mips: octeon: Add Ubiquiti E200 and E220 boards
MIPS: SGI-IP28: disable use of ll/sc in kernel
MIPS: tx49xx: move tx4939_add_memory_regions into only user
MIPS: pgtable: Remove used PAGE_USERIO define
MIPS: alchemy: Share prom_init implementation
MIPS: alchemy: Fix build breakage, if TOUCHSCREEN_WM97XX is disabled
MIPS: process: include exec.h header in process.c
MIPS: process: Add prototype for function arch_dup_task_struct
MIPS: idle: Add prototype for function check_wait
...
Linus Torvalds [Fri, 16 Oct 2020 19:36:38 +0000 (12:36 -0700)]
Merge tag 's390-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Vasily Gorbik:
- Remove address space overrides using set_fs()
- Convert to generic vDSO
- Convert to generic page table dumper
- Add ARCH_HAS_DEBUG_WX support
- Add leap seconds handling support
- Add NVMe firmware-assisted kernel dump support
- Extend NVMe boot support with memory clearing control and addition of
kernel parameters
- AP bus and zcrypt api code rework. Add adapter configure/deconfigure
interface. Extend debug features. Add failure injection support
- Add ECC secure private keys support
- Add KASan support for running protected virtualization host with
4-level paging
- Utilize destroy page ultravisor call to speed up secure guests
shutdown
- Implement ioremap_wc() and ioremap_prot() with MIO in PCI code
- Various checksum improvements
- Other small various fixes and improvements all over the code
* tag 's390-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (85 commits)
s390/uaccess: fix indentation
s390/uaccess: add default cases for __put_user_fn()/__get_user_fn()
s390/zcrypt: fix wrong format specifications
s390/kprobes: move insn_page to text segment
s390/sie: fix typo in SIGP code description
s390/lib: fix kernel doc for memcmp()
s390/zcrypt: Introduce Failure Injection feature
s390/zcrypt: move ap_msg param one level up the call chain
s390/ap/zcrypt: revisit ap and zcrypt error handling
s390/ap: Support AP card SCLP config and deconfig operations
s390/sclp: Add support for SCLP AP adapter config/deconfig
s390/ap: add card/queue deconfig state
s390/ap: add error response code field for ap queue devices
s390/ap: split ap queue state machine state from device state
s390/zcrypt: New config switch CONFIG_ZCRYPT_DEBUG
s390/zcrypt: introduce msg tracking in zcrypt functions
s390/startup: correct early pgm check info formatting
s390: remove orphaned extern variables declarations
s390/kasan: make sure int handler always run with DAT on
s390/ipl: add support to control memory clearing for nvme re-IPL
...
Linus Torvalds [Fri, 16 Oct 2020 19:21:15 +0000 (12:21 -0700)]
Merge tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- A series from Nick adding ARCH_WANT_IRQS_OFF_ACTIVATE_MM & selecting
it for powerpc, as well as a related fix for sparc.
- Remove support for PowerPC 601.
- Some fixes for watchpoints & addition of a new ptrace flag for
detecting ISA v3.1 (Power10) watchpoint features.
- A fix for kernels using 4K pages and the hash MMU on bare metal
Power9 systems with > 16TB of RAM, or RAM on the 2nd node.
- A basic idle driver for shallow stop states on Power10.
- Tweaks to our sched domains code to better inform the scheduler about
the hardware topology on Power9/10, where two SMT4 cores can be
presented by firmware as an SMT8 core.
- A series doing further reworks & cleanups of our EEH code.
- Addition of a filter for RTAS (firmware) calls done via sys_rtas(),
to prevent root from overwriting kernel memory.
- Other smaller features, fixes & cleanups.
Thanks to: Alexey Kardashevskiy, Andrew Donnellan, Aneesh Kumar K.V,
Athira Rajeev, Biwen Li, Cameron Berkenpas, Cédric Le Goater, Christophe
Leroy, Christoph Hellwig, Colin Ian King, Daniel Axtens, David Dai, Finn
Thain, Frederic Barrat, Gautham R. Shenoy, Greg Kurz, Gustavo Romero,
Ira Weiny, Jason Yan, Joel Stanley, Jordan Niethe, Kajol Jain, Konrad
Rzeszutek Wilk, Laurent Dufour, Leonardo Bras, Liu Shixin, Luca
Ceresoli, Madhavan Srinivasan, Mahesh Salgaonkar, Nathan Lynch, Nicholas
Mc Guire, Nicholas Piggin, Nick Desaulniers, Oliver O'Halloran, Pedro
Miraglia Franco de Carvalho, Pratik Rajesh Sampat, Qian Cai, Qinglang
Miao, Ravi Bangoria, Russell Currey, Satheesh Rajendran, Scott Cheloha,
Segher Boessenkool, Srikar Dronamraju, Stan Johnson, Stephen Kitt,
Stephen Rothwell, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain,
Vaidyanathan Srinivasan, Vasant Hegde, Wang Wensheng, Wolfram Sang, Yang
Yingliang, zhengbin.
* tag 'powerpc-5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (228 commits)
Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed"
selftests/powerpc: Fix eeh-basic.sh exit codes
cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_reboot_notifier
powerpc/time: Make get_tb() common to PPC32 and PPC64
powerpc/time: Make get_tbl() common to PPC32 and PPC64
powerpc/time: Remove get_tbu()
powerpc/time: Avoid using get_tbl() and get_tbu() internally
powerpc/time: Make mftb() common to PPC32 and PPC64
powerpc/time: Rename mftbl() to mftb()
powerpc/32s: Remove #ifdef CONFIG_PPC_BOOK3S_32 in head_book3s_32.S
powerpc/32s: Rename head_32.S to head_book3s_32.S
powerpc/32s: Setup the early hash table at all time.
powerpc/time: Remove ifdef in get_dec() and set_dec()
powerpc: Remove get_tb_or_rtc()
powerpc: Remove __USE_RTC()
powerpc: Tidy up a bit after removal of PowerPC 601.
powerpc: Remove support for PowerPC 601
powerpc: Remove PowerPC 601
powerpc: Drop SYNC_601() ISYNC_601() and SYNC()
powerpc: Remove CONFIG_PPC601_SYNC_FIX
...
Linus Torvalds [Fri, 16 Oct 2020 18:31:55 +0000 (11:31 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
"155 patches.
Subsystems affected by this patch series: mm (dax, debug, thp,
readahead, page-poison, util, memory-hotplug, zram, cleanups), misc,
core-kernel, get_maintainer, MAINTAINERS, lib, bitops, checkpatch,
binfmt, ramfs, autofs, nilfs, rapidio, panic, relay, kgdb, ubsan,
romfs, and fault-injection"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (155 commits)
lib, uaccess: add failure injection to usercopy functions
lib, include/linux: add usercopy failure capability
ROMFS: support inode blocks calculation
ubsan: introduce CONFIG_UBSAN_LOCAL_BOUNDS for Clang
sched.h: drop in_ubsan field when UBSAN is in trap mode
scripts/gdb/tasks: add headers and improve spacing format
scripts/gdb/proc: add struct mount & struct super_block addr in lx-mounts command
kernel/relay.c: drop unneeded initialization
panic: dump registers on panic_on_warn
rapidio: fix the missed put_device() for rio_mport_add_riodev
rapidio: fix error handling path
nilfs2: fix some kernel-doc warnings for nilfs2
autofs: harden ioctl table
ramfs: fix nommu mmap with gaps in the page cache
mm: remove the now-unnecessary mmget_still_valid() hack
mm/gup: take mmap_lock in get_dump_page()
binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot
coredump: rework elf/elf_fdpic vma_dump_size() into common helper
coredump: refactor page range dumping into common helper
coredump: let dump_emit() bail out on short writes
...
Patch series "add fault injection to user memory access", v3.
The goal of this series is to improve testing of fault-tolerance in usages
of user memory access functions, by adding support for fault injection.
syzkaller/syzbot are using the existing fault injection modes and will use
this particular feature also.
The first patch adds failure injection capability for usercopy functions.
The second changes usercopy functions to use this new failure capability
(copy_from_user, ...). The third patch adds get/put/clear_user failures
to x86.
This patch (of 3):
Add a failure injection capability to improve testing of fault-tolerance
in usages of user memory access functions.
Add CONFIG_FAULT_INJECTION_USERCOPY to enable faults in usercopy
functions. The should_fail_usercopy function is to be called by these
functions (copy_from_user, get_user, ...) in order to fail or not.
Signed-off-by: Albert van der Linde <alinde@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Akinobu Mita <akinobu.mita@gmail.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20200831171733.955393-1-alinde@google.com Link: http://lkml.kernel.org/r/20200831171733.955393-2-alinde@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Libing Zhou [Fri, 16 Oct 2020 03:13:42 +0000 (20:13 -0700)]
ROMFS: support inode blocks calculation
When use 'stat' tool to display file status, the 'Blocks' field always in
'0', this is not good for tool 'du'(e.g.: busybox 'du'), it always output
'0' size for the files under ROMFS since such tool calculates number of
512B Blocks.
This patch calculates approx. number of 512B blocks based on inode size.
Signed-off-by: Libing Zhou <libing.zhou@nokia-sbell.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: http://lkml.kernel.org/r/20200811052606.4243-1-libing.zhou@nokia-sbell.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
George Popescu [Fri, 16 Oct 2020 03:13:38 +0000 (20:13 -0700)]
ubsan: introduce CONFIG_UBSAN_LOCAL_BOUNDS for Clang
When the kernel is compiled with Clang, -fsanitize=bounds expands to
-fsanitize=array-bounds and -fsanitize=local-bounds.
Enabling -fsanitize=local-bounds with Clang has the unfortunate
side-effect of inserting traps; this goes back to its original intent,
which was as a hardening and not a debugging feature [1]. The same
feature made its way into -fsanitize=bounds, but the traps remained. For
that reason, -fsanitize=bounds was split into 'array-bounds' and
'local-bounds' [2].
Since 'local-bounds' doesn't behave like a normal sanitizer, enable it
with Clang only if trapping behaviour was requested by
CONFIG_UBSAN_TRAP=y.
Add the UBSAN_BOUNDS_LOCAL config to Kconfig.ubsan to enable the
'local-bounds' option by default when UBSAN_TRAP is enabled.
Elena Petrova [Fri, 16 Oct 2020 03:13:35 +0000 (20:13 -0700)]
sched.h: drop in_ubsan field when UBSAN is in trap mode
in_ubsan field of task_struct is only used in lib/ubsan.c, which in its
turn is used only `ifneq ($(CONFIG_UBSAN_TRAP),y)`.
Removing unnecessary field from a task_struct will help preserve the ABI
between vanilla and CONFIG_UBSAN_TRAP'ed kernels. In particular, this
will help enabling bounds sanitizer transparently for Android's GKI.
Sudip Mukherjee [Fri, 16 Oct 2020 03:13:25 +0000 (20:13 -0700)]
kernel/relay.c: drop unneeded initialization
The variable 'consumed' is initialized with the consumed count but
immediately after that the consumed count is updated and assigned to
'consumed' again thus overwriting the previous value. So, drop the
unneeded initialization.
Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/20201005205727.1147-1-sudipm.mukherjee@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexey Kardashevskiy [Fri, 16 Oct 2020 03:13:22 +0000 (20:13 -0700)]
panic: dump registers on panic_on_warn
Currently we print stack and registers for ordinary warnings but we do not
for panic_on_warn which looks as oversight - panic() will reboot the
machine but won't print registers.
This moves printing of registers and modules earlier.
This does not move the stack dumping as panic() dumps it.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Douglas Anderson <dianders@chromium.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Rafael Aquini <aquini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Link: https://lkml.kernel.org/r/20200804095054.68724-1-aik@ozlabs.ru Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Souptick Joarder [Fri, 16 Oct 2020 03:13:15 +0000 (20:13 -0700)]
rapidio: fix error handling path
rio_dma_transfer() attempts to clamp the return value of
pin_user_pages_fast() to be >= 0. However, the attempt fails because
nr_pages is overridden a few lines later, and restored to the undesirable
-ERRNO value.
The return value is ultimately stored in nr_pages, which in turn is passed
to unpin_user_pages(), which expects nr_pages >= 0, else, disaster.
Fix this by fixing the nesting of the assignment to nr_pages: nr_pages
should be clamped to zero if pin_user_pages_fast() returns -ERRNO, or set
to the return value of pin_user_pages_fast(), otherwise.
[jhubbard@nvidia.com: new changelog]
Fixes: e8de370188d09 ("rapidio: add mport char device driver") Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Alexandre Bounine <alex.bou9@gmail.com> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lkml.kernel.org/r/1600227737-20785-1-git-send-email-jrdr.linux@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wang Hai [Fri, 16 Oct 2020 03:13:11 +0000 (20:13 -0700)]
nilfs2: fix some kernel-doc warnings for nilfs2
Fixes the following W=1 kernel build warning(s):
fs/nilfs2/bmap.c:378: warning: Excess function parameter 'bhp' description in 'nilfs_bmap_assign'
fs/nilfs2/cpfile.c:907: warning: Excess function parameter 'status' description in 'nilfs_cpfile_change_cpmode'
fs/nilfs2/cpfile.c:946: warning: Excess function parameter 'stat' description in 'nilfs_cpfile_get_stat'
fs/nilfs2/page.c:76: warning: Excess function parameter 'inode' description in 'nilfs_forget_buffer'
fs/nilfs2/sufile.c:563: warning: Excess function parameter 'stat' description in 'nilfs_sufile_get_stat'
Matthew Wilcox [Fri, 16 Oct 2020 03:13:08 +0000 (20:13 -0700)]
autofs: harden ioctl table
The table of ioctl functions should be marked const in order to put them
in read-only memory, and we should use array_index_nospec() to avoid
speculation disclosing the contents of kernel memory to userspace.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Ian Kent <raven@themaw.net> Link: https://lkml.kernel.org/r/20200818122203.GO17456@casper.infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox (Oracle) [Fri, 16 Oct 2020 03:13:04 +0000 (20:13 -0700)]
ramfs: fix nommu mmap with gaps in the page cache
ramfs needs to check that pages are both physically contiguous and
contiguous in the file. If the page cache happens to have, eg, page A for
index 0 of the file, no page for index 1, and page A+1 for index 2, then
an mmap of the first two pages of the file will succeed when it should
fail.
Fixes: 642fb4d1f1dd ("[PATCH] NOMMU: Provide shared-writable mmap support on ramfs") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Howells <dhowells@redhat.com> Link: https://lkml.kernel.org/r/20200914122239.GO6583@casper.infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jann Horn [Fri, 16 Oct 2020 03:13:00 +0000 (20:13 -0700)]
mm: remove the now-unnecessary mmget_still_valid() hack
The preceding patches have ensured that core dumping properly takes the
mmap_lock. Thanks to that, we can now remove mmget_still_valid() and all
its users.
Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200827114932.3572699-8-jannh@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jann Horn [Fri, 16 Oct 2020 03:12:57 +0000 (20:12 -0700)]
mm/gup: take mmap_lock in get_dump_page()
Properly take the mmap_lock before calling into the GUP code from
get_dump_page(); and play nice, allowing the GUP code to drop the
mmap_lock if it has to sleep.
As Linus pointed out, we don't actually need the VMA because
__get_user_pages() will flush the dcache for us if necessary.
Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200827114932.3572699-7-jannh@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jann Horn [Fri, 16 Oct 2020 03:12:54 +0000 (20:12 -0700)]
binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot
In both binfmt_elf and binfmt_elf_fdpic, use a new helper
dump_vma_snapshot() to take a snapshot of the VMA list (including the gate
VMA, if we have one) while protected by the mmap_lock, and then use that
snapshot instead of walking the VMA list without locking.
An alternative approach would be to keep the mmap_lock held across the
entire core dumping operation; however, keeping the mmap_lock locked while
we may be blocked for an unbounded amount of time (e.g. because we're
dumping to a FUSE filesystem or so) isn't really optimal; the mmap_lock
blocks things like the ->release handler of userfaultfd, and we don't
really want critical system daemons to grind to a halt just because
someone "gifted" them SCM_RIGHTS to an eternally-locked userfaultfd, or
something like that.
Since both the normal ELF code and the FDPIC ELF code need this
functionality (and if any other binfmt wants to add coredump support in
the future, they'd probably need it, too), implement this with a common
helper in fs/coredump.c.
A downside of this approach is that we now need a bigger amount of kernel
memory per userspace VMA in the normal ELF case, and that we need O(n)
kernel memory in the FDPIC ELF case at all; but 40 bytes per VMA shouldn't
be terribly bad.
There currently is a data race between stack expansion and anything that
reads ->vm_start or ->vm_end under the mmap_lock held in read mode; to
mitigate that for core dumping, take the mmap_lock in write mode when
taking a snapshot of the VMA hierarchy. (If we only took the mmap_lock in
read mode, we could end up with a corrupted core dump if someone does
get_user_pages_remote() concurrently. Not really a major problem, but
taking the mmap_lock either way works here, so we might as well avoid the
issue.) (This doesn't do anything about the existing data races with stack
expansion in other mm code.)
Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200827114932.3572699-6-jannh@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jann Horn [Fri, 16 Oct 2020 03:12:50 +0000 (20:12 -0700)]
coredump: rework elf/elf_fdpic vma_dump_size() into common helper
At the moment, the binfmt_elf and binfmt_elf_fdpic code have slightly
different code to figure out which VMAs should be dumped, and if so,
whether the dump should contain the entire VMA or just its first page.
Eliminate duplicate code by reworking the binfmt_elf version into a
generic core dumping helper in coredump.c.
As part of that, change the heuristic for detecting executable/library
header pages to check whether the inode is executable instead of looking
at the file mode.
This is less problematic in terms of locking because it lets us avoid
get_user() under the mmap_sem. (And arguably it looks nicer and makes
more sense in generic code.)
Adjust a little bit based on the binfmt_elf_fdpic version: ->anon_vma is
only meaningful under CONFIG_MMU, otherwise we have to assume that the VMA
has been written to.
Jann Horn [Fri, 16 Oct 2020 03:12:43 +0000 (20:12 -0700)]
coredump: let dump_emit() bail out on short writes
dump_emit() has a retry loop, but there seems to be no way for that retry
logic to actually be used; and it was also buggy, writing the same data
repeatedly after a short write.
Jann Horn [Fri, 16 Oct 2020 03:12:40 +0000 (20:12 -0700)]
binfmt_elf_fdpic: stop using dump_emit() on user pointers on !MMU
Patch series "Fix ELF / FDPIC ELF core dumping, and use mmap_lock properly in there", v5.
At the moment, we have that rather ugly mmget_still_valid() helper to work
around <https://crbug.com/project-zero/1790>: ELF core dumping doesn't
take the mmap_sem while traversing the task's VMAs, and if anything (like
userfaultfd) then remotely messes with the VMA tree, fireworks ensue. So
at the moment we use mmget_still_valid() to bail out in any writers that
might be operating on a remote mm's VMAs.
With this series, I'm trying to get rid of the need for that as cleanly as
possible. ("cleanly" meaning "avoid holding the mmap_lock across
unbounded sleeps".)
Patches 1, 2, 3 and 4 are relatively unrelated cleanups in the core
dumping code.
Patches 5 and 6 implement the main change: Instead of repeatedly accessing
the VMA list with sleeps in between, we snapshot it at the start with
proper locking, and then later we just use our copy of the VMA list. This
ensures that the kernel won't crash, that VMA metadata in the coredump is
consistent even in the presence of concurrent modifications, and that any
virtual addresses that aren't being concurrently modified have their
contents show up in the core dump properly.
The disadvantage of this approach is that we need a bit more memory during
core dumping for storing metadata about all VMAs.
At the end of the series, patch 7 removes the old workaround for this
issue (mmget_still_valid()).
I have tested:
- Creating a simple core dump on X86-64 still works.
- The created coredump on X86-64 opens in GDB and looks plausible.
- X86-64 core dumps contain the first page for executable mappings at
offset 0, and don't contain the first page for non-executable file
mappings or executable mappings at offset !=0.
- NOMMU 32-bit ARM can still generate plausible-looking core dumps
through the FDPIC implementation. (I can't test this with GDB because
GDB is missing some structure definition for nommu ARM, but I've
poked around in the hexdump and it looked decent.)
This patch (of 7):
dump_emit() is for kernel pointers, and VMAs describe userspace memory.
Let's be tidy here and avoid accessing userspace pointers under KERNEL_DS,
even if it probably doesn't matter much on !MMU systems - especially given
that it looks like we can just use the same get_dump_page() as on MMU if
we move it out of the CONFIG_MMU block.
One small change we have to make in get_dump_page() is to use
__get_user_pages_locked() instead of __get_user_pages(), since the latter
doesn't exist on nommu. On mmu builds, __get_user_pages_locked() will
just call __get_user_pages() for us.
Chris Kennelly [Fri, 16 Oct 2020 03:12:32 +0000 (20:12 -0700)]
fs/binfmt_elf: use PT_LOAD p_align values for suitable start address
Patch series "Selecting Load Addresses According to p_align", v3.
The current ELF loading mechancism provides page-aligned mappings. This
can lead to the program being loaded in a way unsuitable for file-backed,
transparent huge pages when handling PIE executables.
While specifying -z,max-page-size=0x200000 to the linker will generate
suitably aligned segments for huge pages on x86_64, the executable needs
to be loaded at a suitably aligned address as well. This alignment
requires the binary's cooperation, as distinct segments need to be
appropriately paddded to be eligible for THP.
For binaries built with increased alignment, this limits the number of
bits usable for ASLR, but provides some randomization over using fixed
load addresses/non-PIE binaries.
This patch (of 2):
The current ELF loading mechancism provides page-aligned mappings. This
can lead to the program being loaded in a way unsuitable for file-backed,
transparent huge pages when handling PIE executables.
For binaries built with increased alignment, this limits the number of
bits usable for ASLR, but provides some randomization over using fixed
load addresses/non-PIE binaries.
Tested by verifying program with -Wl,-z,max-page-size=0x200000 loading.
[akpm@linux-foundation.org: fix max() warning]
[ckennelly@google.com: augment comment] Link: https://lkml.kernel.org/r/20200821233848.3904680-2-ckennelly@google.com Signed-off-by: Chris Kennelly <ckennelly@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Song Liu <songliubraving@fb.com> Cc: David Rientjes <rientjes@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Hugh Dickens <hughd@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sandeep Patil <sspatil@google.com> Cc: Fangrui Song <maskray@google.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Link: https://lkml.kernel.org/r/20200820170541.1132271-1-ckennelly@google.com Link: https://lkml.kernel.org/r/20200820170541.1132271-2-ckennelly@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dwaipayan Ray [Fri, 16 Oct 2020 03:12:28 +0000 (20:12 -0700)]
checkpatch: add new warnings to author signoff checks.
The author signed-off-by checks are currently very vague. Cases like same
name or same address are not handled separately.
For example, running checkpatch on commit be6577af0cef ("parisc: Add
atomic64_set_release() define to avoid CPU soft lockups"), gives:
WARNING: Missing Signed-off-by: line by nominal patch author
'John David Anglin <dave.anglin@bell.net>'
The signoff line was:
"Signed-off-by: Dave Anglin <dave.anglin@bell.net>"
Clearly the author has signed off but with a slightly different version
of his name. A more appropriate warning would have been to point out
at the name mismatch instead.
Previously, the values assumed by $authorsignoff were either 0 or 1
to indicate whether a proper sign off by author is present.
Extended the checks to handle four new cases.
$authorsignoff values now denote the following:
0: Missing sign off by patch author.
1: Sign off present and identical.
2: Addresses and names match, but comments differ.
"James Watson(JW) <james@gmail.com>", "James Watson <james@gmail.com>"
3: Addresses match, but names are different.
"James Watson <james@gmail.com>", "James <james@gmail.com>"
4: Names match, but addresses are different.
"James Watson <james@watson.com>", "James Watson <james@gmail.com>"
Łukasz Stelmach [Fri, 16 Oct 2020 03:12:25 +0000 (20:12 -0700)]
checkpatch: fix false positive on empty block comment lines
To avoid false positives in presence of SPDX-License-Identifier in
networking files it is required to increase the leeway for empty block
comment lines by one line.
For example, checking drivers/net/loopback.c which starts with
// SPDX-License-Identifier: GPL-2.0-or-later
/*
* INET An implementation of the TCP/IP protocol suite for the LINUX
rsults in an unnecessary warning
WARNING: networking block comments don't use an empty /* line, use /* Comment...
+/*
+ * INET An implementation of the TCP/IP protocol suite for the LINUX
Signed-off-by: Łukasz Stelmach <l.stelmach@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Joe Perches <joe@perches.com> Cc: Bartłomiej Żolnierkiewicz <b.zolnierkie@samsung.co> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lkml.kernel.org/r/20201006083509.19934-1-l.stelmach@samsung.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dwaipayan Ray [Fri, 16 Oct 2020 03:12:22 +0000 (20:12 -0700)]
checkpatch: fix multi-statement macro checks for while blocks.
Checkpatch.pl doesn't have a check for excluding while (...) {...} blocks
from MULTISTATEMENT_MACRO_USE_DO_WHILE error.
For example, running checkpatch.pl on the file mm/maccess.c in the kernel
generates the following error:
ERROR: Macros with complex values should be enclosed in parentheses
+#define copy_from_kernel_nofault_loop(dst, src, len, type, err_label) \
+ while (len >= sizeof(type)) { \
+ __get_kernel_nofault(dst, src, type, err_label); \
+ dst += sizeof(type); \
+ src += sizeof(type); \
+ len -= sizeof(type); \
+ }
The error is misleading for this case. Enclosing it in parentheses
doesn't make any sense.
Checkpatch already has an exception list for such common macro types.
Added a new exception for while (...) {...} style blocks to the same.
In addition, the brace flatten logic was modified by changing the
substitution characters from "1" to "1u". This was done to ensure that
macros in the form "#define foo(bar) while(bar){bar--;}" were also
correctly procecssed.
Dwaipayan Ray [Fri, 16 Oct 2020 03:12:15 +0000 (20:12 -0700)]
checkpatch: extend author Signed-off-by check for split From: header
Checkpatch did not handle cases where the author From: header was split
into multiple lines. The author identity could not be resolved and
checkpatch generated a false NO_AUTHOR_SIGN_OFF warning.
A typical example is commit e33bcbab16d1 ("tee: add support for session's
client UUID generation"). When checkpatch was run on this commit, it
displayed:
"WARNING:NO_AUTHOR_SIGN_OFF: Missing Signed-off-by: line by nominal
patch author ''"
This was due to split header lines not being handled properly and the
author himself wrote in commit cd2614967d8b ("checkpatch: warn if missing
author Signed-off-by"):
"Split From: headers are not fully handled: only the first part
is compared."
Support split From: headers by correctly parsing the header extension
lines. RFC 5322, Section-2.2.3 stated that each extended line must start
with a WSP character (a space or htab). The solution was therefore to
concatenate the lines which start with a WSP to get the correct long
header.
Joe Perches [Fri, 16 Oct 2020 03:12:12 +0000 (20:12 -0700)]
checkpatch: allow not using -f with files that are in git
If a file exists in git and checkpatch is used without the -f flag for
scanning a file, then checkpatch will scan the file assuming it's a patch
and emit:
ERROR: Does not appear to be a unified-diff format patch
Change the behavior to assume the -f flag if the file exists in git.
Joe Perches [Fri, 16 Oct 2020 03:12:09 +0000 (20:12 -0700)]
checkpatch: warn on self-assignments
The uninitialized_var() macro was removed recently via commit 63a0895d960a
("compiler: Remove uninitialized_var() macro") as it's not a particularly
useful warning and its use can "paper over real bugs".
Add a checkpatch test to warn on self-assignments as a means to avoid
compiler warnings and as a back-door mechanism to reproduce the old
uninitialized_var macro behavior.
[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Kees Cook <keescook@chromium.org> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: Denis Efremov <efremov@linux.com> Cc: Julia Lawall <julia.lawall@inria.fr> Link: https://lkml.kernel.org/r/afc2cffdd315d3e4394af149278df9e8af7f49f4.camel@perches.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rikard Falkeborn [Fri, 16 Oct 2020 03:12:05 +0000 (20:12 -0700)]
const_structs.checkpatch: add pinctrl_ops and pinmux_ops
All usages of include/linux of these are const pointers, and all instances
in the kernel except one, that are not const can be made const (patches
have been posted for those separately).
Nicolas Boichat [Fri, 16 Oct 2020 03:12:02 +0000 (20:12 -0700)]
checkpatch: warn if trace_printk and friends are called
trace_printk is meant as a debugging tool, and should not be compiled into
production code without specific debug Kconfig options enabled, or source
code changes, as indicated by the warning that shows up on boot if any
trace_printk is called:
** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
** **
** trace_printk() being used. Allocating extra memory. **
** **
** This means that this is a DEBUG kernel and it is **
** unsafe for production use. **
Let's warn developers when they try to submit such a change.
Rikard Falkeborn [Fri, 16 Oct 2020 03:11:59 +0000 (20:11 -0700)]
const_structs.checkpatch: add phy_ops
All usages of phy_ops in include/linux uses const phy_ops * and all
instances of phy_ops in the kernel that are not const already can be made
const (patches have been posted for those separately).
Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Kishon Vijay Abraham I <kishon@ti.com> Cc: Vinod Koul <vkoul@kernel.org> Link: https://lkml.kernel.org/r/20200824214132.9072-1-rikard.falkeborn@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Fri, 16 Oct 2020 03:11:56 +0000 (20:11 -0700)]
checkpatch: add test for comma use that should be semicolon
There are commas used as statement terminations that should typically have
used semicolons instead. Only direct assignments or use of a single
function or value on a single line are detected by this test.
e.g.:
foo = bar(), /* typical use is semicolon not comma */
bar = baz();
Add an imperfect test to detect these comma uses.
No false positives were found in testing, but many types of false
negatives are possible.
e.g.:
foo = bar() + 1, /* comma use, but not direct assignment */
bar = baz();
Jerome Forissier [Fri, 16 Oct 2020 03:11:49 +0000 (20:11 -0700)]
checkpatch: add --kconfig-prefix
Kconfig allows to customize the CONFIG_ prefix via the $CONFIG_
environment variable. Out-of-tree projects may therefore use Kconfig with
a different prefix, or they may use a custom configuration tool which does
not use the CONFIG_ prefix at all. Such projects may still want to adhere
to the Linux kernel coding style and run checkpatch.pl.
One example is OP-TEE [1] which does not use Kconfig but does have
configuration options prefixed with CFG_. It also mostly follows the
kernel coding style and therefore being able to use checkpatch is quite
valuable.
To make this possible, add the --kconfig-prefix command line option.
[1] https://github.com/OP-TEE/optee_os
Signed-off-by: Jerome Forissier <jerome@forissier.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Joe Perches <joe@perches.com> Link: http://lkml.kernel.org/r/20200818081732.800449-1-jerome@forissier.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tobias Jordan [Fri, 16 Oct 2020 03:11:38 +0000 (20:11 -0700)]
lib/crc32.c: fix trivial typo in preprocessor condition
Whether crc32_be needs a lookup table is chosen based on CRC_LE_BITS.
Obviously, the _be function should be governed by the _BE_ define.
This probably never pops up as it's hard to come up with a configuration
where CRC_BE_BITS isn't the same as CRC_LE_BITS and as nobody is using
bitwise CRC anyway.
Fixes: 46c5801eaf86 ("crc32: bolt on crc32c") Signed-off-by: Tobias Jordan <kernel@cdqe.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Link: https://lkml.kernel.org/r/20200923182122.GA3338@agrajag.zerfleddert.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Carpenter [Fri, 16 Oct 2020 03:11:34 +0000 (20:11 -0700)]
lib/test_hmm.c: fix an error code in dmirror_allocate_chunk()
This is supposed to return false on failure, not a negative error code.
Fixes: 170e38548b81 ("mm/hmm/test: use after free in dmirror_allocate_chunk()") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Dan Williams <dan.j.williams@intel.com> Link: https://lkml.kernel.org/r/20201010200812.GA1886610@mwanda Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stephen Boyd [Fri, 16 Oct 2020 03:11:21 +0000 (20:11 -0700)]
lib/idr.c: document that ida_simple_{get,remove}() are deprecated
These two functions are deprecated. Users should call ida_alloc() or
ida_free() respectively instead. Add documentation to this effect until
the macro can be removed.
Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Tri Vo <trong@android.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Link: https://lkml.kernel.org/r/20200910055246.2297797-2-swboyd@chromium.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stephen Boyd [Fri, 16 Oct 2020 03:11:17 +0000 (20:11 -0700)]
lib/idr.c: document calling context for IDA APIs mustn't use locks
The documentation for these functions indicates that callers don't need to
hold a lock while calling them, but that documentation is only in one
place under "IDA Usage". Let's state the same information on each IDA
function so that it's clear what the calling context requires.
Furthermore, let's document ida_simple_get() with the same information so
that callers know how this API works.
Signed-off-by: Stephen Boyd <swboyd@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Tri Vo <trong@android.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Link: https://lkml.kernel.org/r/20200910055246.2297797-1-swboyd@chromium.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Randy Dunlap [Fri, 16 Oct 2020 03:10:31 +0000 (20:10 -0700)]
kernel: acct.c: fix some kernel-doc nits
Fix kernel-doc notation to use the documented Returns: syntax and place
the function description for acct_process() on the first line where it
should be.
Randy Dunlap [Fri, 16 Oct 2020 03:10:28 +0000 (20:10 -0700)]
kernel/: fix repeated words in comments
Fix multiple occurrences of duplicated words in kernel/.
Fix one typo/spello on the same line as a duplicate word. Change one
instance of "the the" to "that the". Otherwise just drop one of the
repeated words.
Liao Pingfang [Fri, 16 Oct 2020 03:10:25 +0000 (20:10 -0700)]
kernel/sys.c: replace do_brk with do_brk_flags in comment of prctl_set_mm_map()
Replace do_brk with do_brk_flags in comment of prctl_set_mm_map(), since
do_brk was removed in following commit.
Fixes: bb177a732c4369 ("mm: do not bug_on on incorrect length in __mm_populate()") Signed-off-by: Liao Pingfang <liao.pingfang@zte.com.cn> Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/1600650751-43127-1-git-send-email-wang.yi59@zte.com.cn Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andy Shevchenko [Fri, 16 Oct 2020 03:10:21 +0000 (20:10 -0700)]
kernel.h: split out min()/max() et al. helpers
kernel.h is being used as a dump for all kinds of stuff for a long time.
Here is the attempt to start cleaning it up by splitting out min()/max()
et al. helpers.
At the same time convert users in header and lib folder to use new header.
Though for time being include new header back to kernel.h to avoid
twisted indirected includes for other existing users.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Joe Perches <joe@perches.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lkml.kernel.org/r/20200910164152.GA1891694@smile.fi.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Wilcox (Oracle) [Fri, 16 Oct 2020 03:10:15 +0000 (20:10 -0700)]
mm: rename page_order() to buddy_order()
The current page_order() can only be called on pages in the buddy
allocator. For compound pages, you have to use compound_order(). This is
confusing and led to a bug, so rename page_order() to buddy_order().
Miaohe Lin [Fri, 16 Oct 2020 03:10:08 +0000 (20:10 -0700)]
mm: use helper function put_write_access()
In commit 1da177e4c3f4 ("Linux-2.6.12-rc2"), the helper put_write_access()
came with the atomic_dec operation of the i_writecount field. But it
forgot to use this helper in __vma_link_file() and dup_mmap().
Xiaofei Tan [Fri, 16 Oct 2020 03:10:05 +0000 (20:10 -0700)]
mm/workingset.c: fix some doc warnings
Fix following warnings caused by mismatch bewteen function parameters and
comments.
mm/workingset.c:228: warning: Function parameter or member 'lruvec' not described in 'workingset_age_nonresident'
mm/workingset.c:228: warning: Excess function parameter 'memcg' description in 'workingset_age_nonresident'
Wei Yang [Fri, 16 Oct 2020 03:09:49 +0000 (20:09 -0700)]
mm/page_reporting.c: drop stale list head check in page_reporting_cycle
list_for_each_entry_safe() guarantees that we will never stumble over the
list head; "&page->lru != list" will always evaluate to true. Let's
simplify.
[david@redhat.com: Changelog refinements]
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Link: http://lkml.kernel.org/r/20200818084448.33969-1-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Douglas Anderson [Fri, 16 Oct 2020 03:09:43 +0000 (20:09 -0700)]
zram: failing to decompress is WARN_ON worthy
If we fail to decompress in zram it's a pretty serious problem. We were
entrusted to be able to decompress the old data but we failed. Either
we've got some crazy bug in the compression code or we've got memory
corruption.
At the moment, when this happens the log looks like this:
It is true that we have an "ALERT" level log in there, but (at least to
me) it feels like even this isn't enough to impart the seriousness of this
error. Let's convert to a WARN_ON. Note that WARN_ON is automatically
"unlikely" so we can simply replace the old annotation with the new one.
Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Sonny Rao <sonnyrao@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Link: https://lkml.kernel.org/r/20200917174059.1.If09c882545dbe432268f7a67a4d4cfcb6caace4f@changeid Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Fri, 16 Oct 2020 03:09:35 +0000 (20:09 -0700)]
mm/page_alloc: place pages to tail in __free_pages_core()
__free_pages_core() is used when exposing fresh memory to the buddy during
system boot and when onlining memory in generic_online_page().
generic_online_page() is used in two cases:
1. Direct memory onlining in online_pages().
2. Deferred memory onlining in memory-ballooning-like mechanisms (HyperV
balloon and virtio-mem), when parts of a section are kept
fake-offline to be fake-onlined later on.
In 1, we already place pages to the tail of the freelist. Pages will be
freed to MIGRATE_ISOLATE lists first and moved to the tail of the
freelists via undo_isolate_page_range().
In 2, we currently don't implement a proper rule. In case of virtio-mem,
where we currently always online MAX_ORDER - 1 pages, the pages will be
placed to the HEAD of the freelist - undesireable. While the hyper-v
balloon calls generic_online_page() with single pages, usually it will
call it on successive single pages in a larger block.
The pages are fresh, so place them to the tail of the freelist and avoid
the PCP. In __free_pages_core(), remove the now superflouos call to
set_page_refcounted() and add a comment regarding page initialization and
the refcount.
Note: In 2. we currently don't shuffle. If ever relevant (page shuffling
is usually of limited use in virtualized environments), we might want to
shuffle after a sequence of generic_online_page() calls in the relevant
callers.
Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Wei Yang <richard.weiyang@linux.alibaba.com> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Scott Cheloha <cheloha@linux.ibm.com> Link: https://lkml.kernel.org/r/20201005121534.15649-5-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Fri, 16 Oct 2020 03:09:30 +0000 (20:09 -0700)]
mm/page_alloc: move pages to tail in move_to_free_list()
Whenever we move pages between freelists via move_to_free_list()/
move_freepages_block(), we don't actually touch the pages:
1. Page isolation doesn't actually touch the pages, it simply isolates
pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
When undoing isolation, we move the pages back to the target list.
2. Page stealing (steal_suitable_fallback()) moves free pages directly
between lists without touching them.
3. reserve_highatomic_pageblock()/unreserve_highatomic_pageblock() moves
free pages directly between freelists without touching them.
We already place pages to the tail of the freelists when undoing isolation
via __putback_isolated_page(), let's do it in any case (e.g., if order <=
pageblock_order) and document the behavior. To simplify, let's move the
pages to the tail for all move_to_free_list()/move_freepages_block() users.
In 2., the target list is empty, so there should be no change. In 3., we
might observe a change, however, highatomic is more concerned about
allocations succeeding than cache hotness - if we ever realize this change
degrades a workload, we can special-case this instance and add a proper
comment.
This change results in all pages getting onlined via online_pages() to be
placed to the tail of the freelist.
Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Wei Yang <richard.weiyang@linux.alibaba.com> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Link: https://lkml.kernel.org/r/20201005121534.15649-4-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Fri, 16 Oct 2020 03:09:26 +0000 (20:09 -0700)]
mm/page_alloc: place pages to tail in __putback_isolated_page()
__putback_isolated_page() already documents that pages will be placed to
the tail of the freelist - this is, however, not the case for "order >=
MAX_ORDER - 2" (see buddy_merge_likely()) - which should be the case for
all existing users.
This change affects two users:
- free page reporting
- page isolation, when undoing the isolation (including memory onlining).
This behavior is desirable for pages that haven't really been touched
lately, so exactly the two users that don't actually read/write page
content, but rather move untouched pages.
The new behavior is especially desirable for memory onlining, where we
allow allocation of newly onlined pages via undo_isolate_page_range() in
online_pages(). Right now, we always place them to the head of the
freelist, resulting in undesireable behavior: Assume we add individual
memory chunks via add_memory() and online them right away to the NORMAL
zone. We create a dependency chain of unmovable allocations e.g., via the
memmap. The memmap of the next chunk will be placed onto previous chunks
- if the last block cannot get offlined+removed, all dependent ones cannot
get offlined+removed. While this can already be observed with individual
DIMMs, it's more of an issue for virtio-mem (and I suspect also ppc
DLPAR).
Document that this should only be used for optimizations, and no code
should rely on this behavior for correction (if the order of the freelists
ever changes).
We won't care about page shuffling: memory onlining already properly
shuffles after onlining. free page reporting doesn't care about
physically contiguous ranges, and there are already cases where page
isolation will simply move (physically close) free pages to (currently)
the head of the freelists via move_freepages_block() instead of shuffling.
If this becomes ever relevant, we should shuffle the whole zone when
undoing isolation of larger ranges, and after free_contig_range().
Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Wei Yang <richard.weiyang@linux.alibaba.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Link: https://lkml.kernel.org/r/20201005121534.15649-3-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Fri, 16 Oct 2020 03:09:20 +0000 (20:09 -0700)]
mm/page_alloc: convert "report" flag of __free_one_page() to a proper flag
Patch series "mm: place pages to the freelist tail when onlining and undoing isolation", v2.
When adding separate memory blocks via add_memory*() and onlining them
immediately, the metadata (especially the memmap) of the next block will
be placed onto one of the just added+onlined block. This creates a chain
of unmovable allocations: If the last memory block cannot get
offlined+removed() so will all dependent ones. We directly have unmovable
allocations all over the place.
This can be observed quite easily using virtio-mem, however, it can also
be observed when using DIMMs. The freshly onlined pages will usually be
placed to the head of the freelists, meaning they will be allocated next,
turning the just-added memory usually immediately un-removable. The fresh
pages are cold, prefering to allocate others (that might be hot) also
feels to be the natural thing to do.
It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
adding separate, successive memory blocks, each memory block will have
unmovable allocations on them - for example gigantic pages will fail to
allocate.
While the ZONE_NORMAL doesn't provide any guarantees that memory can get
offlined+removed again (any kind of fragmentation with unmovable
allocations is possible), there are many scenarios (hotplugging a lot of
memory, running workload, hotunplug some memory/as much as possible) where
we can offline+remove quite a lot with this patchset.
a) To visualize the problem, a very simple example:
Start a VM with 4GB and 8GB of virtio-mem memory:
[root@localhost ~]# lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x00000000bfffffff 3G online yes 0-23
0x0000000100000000-0x000000033fffffff 9G online yes 32-103
Memory block size: 128M
Total online memory: 12G
Total offline memory: 0B
Then try to unplug as much as possible using virtio-mem. Observe which
memory blocks are still around. Without this patch set:
Memory block size: 128M
Total online memory: 8.1G
Total offline memory: 0B
With this patch set:
[root@localhost ~]# lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x00000000bfffffff 3G online yes 0-23
0x0000000100000000-0x000000013fffffff 1G online yes 32-39
Memory block size: 128M
Total online memory: 4G
Total offline memory: 0B
All memory can get unplugged, all memory block can get removed. Of
course, no workload ran and the system was basically idle, but it
highlights the issue - the fairly deterministic chain of unmovable
allocations. When a huge page for the 2MB memmap is needed, a
just-onlined 4MB page will be split. The remaining 2MB page will be used
for the memmap of the next memory block. So one memory block will hold
the memmap of the two following memory blocks. Finally the pages of the
last-onlined memory block will get used for the next bigger allocations -
if any allocation is unmovable, all dependent memory blocks cannot get
unplugged and removed until that allocation is gone.
Note that with bigger memory blocks (e.g., 256MB), *all* memory
blocks are dependent and none can get unplugged again!
b) Experiment with memory intensive workload
I performed an experiment with an older version of this patch set (before
we used undo_isolate_page_range() in online_pages(): Hotplug 56GB to a VM
with an initial 4GB, onlining all memory to ZONE_NORMAL right from the
kernel when adding it. I then run various memory intensive workloads that
consume most system memory for a total of 45 minutes. Once finished, I
try to unplug as much memory as possible.
With this change, I am able to remove via virtio-mem (adding individual
128MB memory blocks) 413 out of 448 added memory blocks. Via individual
(256MB) DIMMs 380 out of 448 added memory blocks. (I don't have any
numbers without this patchset, but looking at the above example, it's at
most half of the 448 memory blocks for virtio-mem, and most probably none
for DIMMs).
Again, there are workloads that might behave very differently due to the
nature of ZONE_NORMAL.
This change also affects (besides memory onlining):
- Other users of undo_isolate_page_range(): Pages are always placed to the
tail.
-- When memory offlining fails
-- When memory isolation fails after having isolated some pageblocks
-- When alloc_contig_range() either succeeds or fails
- Other users of __putback_isolated_page(): Pages are always placed to the
tail.
-- Free page reporting
- Other users of __free_pages_core()
-- AFAIKs, any memory that is getting exposed to the buddy during boot.
IIUC we will now usually allocate memory from lower addresses within
a zone first (especially during boot).
- Other users of generic_online_page()
-- Hyper-V balloon
This patch (of 5):
Let's prepare for additional flags and avoid long parameter lists of
bools. Follow-up patches will also make use of the flags in
__free_pages_ok().
Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Wei Yang <richard.weiyang@linux.alibaba.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Link: https://lkml.kernel.org/r/20201005121534.15649-1-david@redhat.com Link: https://lkml.kernel.org/r/20201005121534.15649-2-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Laurent Dufour [Fri, 16 Oct 2020 03:09:15 +0000 (20:09 -0700)]
mm: don't panic when links can't be created in sysfs
At boot time, or when doing memory hot-add operations, if the links in
sysfs can't be created, the system is still able to run, so just report
the error in the kernel log rather than BUG_ON and potentially make system
unusable because the callpath can be called with locks held.
Since the number of memory blocks managed could be high, the messages are
rate limited.
As a consequence, link_mem_sections() has no status to report anymore.
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: "Rafael J . Wysocki" <rafael@kernel.org> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200915094143.79181-4-ldufour@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Fri, 16 Oct 2020 03:09:12 +0000 (20:09 -0700)]
kernel/resource: make iomem_resource implicit in release_mem_region_adjustable()
"mem" in the name already indicates the root, similar to
release_mem_region() and devm_request_mem_region(). Make it implicit.
The only single caller always passes iomem_resource, other parents are not
applicable.
Suggested-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Baoquan He <bhe@redhat.com> Link: https://lkml.kernel.org/r/20200916073041.10355-1-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Fri, 16 Oct 2020 03:09:07 +0000 (20:09 -0700)]
hv_balloon: try to merge system ram resources
Let's try to merge system ram resources we add, to minimize the number of
resources in /proc/iomem. We don't care about the boundaries of
individual chunks we added.
Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Wei Liu <wei.liu@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Julien Grall <julien@xen.org> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <lenb@kernel.org> Cc: Leonardo Bras <leobras.c@gmail.com> Cc: Libor Pechacek <lpechacek@suse.cz> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Roger Pau Monné <roger.pau@citrix.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Link: https://lkml.kernel.org/r/20200911103459.10306-9-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Fri, 16 Oct 2020 03:09:01 +0000 (20:09 -0700)]
xen/balloon: try to merge system ram resources
Let's try to merge system ram resources we add, to minimize the number of
resources in /proc/iomem. We don't care about the boundaries of
individual chunks we added.
Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Roger Pau Monné <roger.pau@citrix.com> Cc: Julien Grall <julien@xen.org> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Len Brown <lenb@kernel.org> Cc: Leonardo Bras <leobras.c@gmail.com> Cc: Libor Pechacek <lpechacek@suse.cz> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Wei Liu <wei.liu@kernel.org> Link: https://lkml.kernel.org/r/20200911103459.10306-8-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Fri, 16 Oct 2020 03:08:56 +0000 (20:08 -0700)]
virtio-mem: try to merge system ram resources
virtio-mem adds memory in memory block granularity, to be able to remove
it in the same granularity again later, and to grow slowly on demand.
This, however, results in quite a lot of resources when adding a lot of
memory. Resources are effectively stored in a list-based tree. Having a
lot of resources not only wastes memory, it also makes traversing that
tree more expensive, and makes /proc/iomem explode in size (e.g.,
requiring kexec-tools to manually merge resources later when e.g., trying
to create a kdump header).
Of course, with more hotplugged memory, it gets worse. When unplugging
memory blocks again, try_remove_memory() (via offline_and_remove_memory())
will properly split the resource up again.
Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Juergen Gross <jgross@suse.com> Cc: Julien Grall <julien@xen.org> Cc: Kees Cook <keescook@chromium.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Len Brown <lenb@kernel.org> Cc: Leonardo Bras <leobras.c@gmail.com> Cc: Libor Pechacek <lpechacek@suse.cz> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Roger Pau Monné <roger.pau@citrix.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Wei Liu <wei.liu@kernel.org> Link: https://lkml.kernel.org/r/20200911103459.10306-7-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Fri, 16 Oct 2020 03:08:49 +0000 (20:08 -0700)]
mm/memory_hotplug: MEMHP_MERGE_RESOURCE to specify merging of System RAM resources
Some add_memory*() users add memory in small, contiguous memory blocks.
Examples include virtio-mem, hyper-v balloon, and the XEN balloon.
This can quickly result in a lot of memory resources, whereby the actual
resource boundaries are not of interest (e.g., it might be relevant for
DIMMs, exposed via /proc/iomem to user space). We really want to merge
added resources in this scenario where possible.
Let's provide a flag (MEMHP_MERGE_RESOURCE) to specify that a resource
either created within add_memory*() or passed via add_memory_resource()
shall be marked mergeable and merged with applicable siblings.
To implement that, we need a kernel/resource interface to mark selected
System RAM resources mergeable (IORESOURCE_SYSRAM_MERGEABLE) and trigger
merging.
Note: We really want to merge after the whole operation succeeded, not
directly when adding a resource to the resource tree (it would break
add_memory_resource() and require splitting resources again when the
operation failed - e.g., due to -ENOMEM).
Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Roger Pau Monné <roger.pau@citrix.com> Cc: Julien Grall <julien@xen.org> Cc: Baoquan He <bhe@redhat.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Leonardo Bras <leobras.c@gmail.com> Cc: Libor Pechacek <lpechacek@suse.cz> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Link: https://lkml.kernel.org/r/20200911103459.10306-6-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Fri, 16 Oct 2020 03:08:39 +0000 (20:08 -0700)]
mm/memory_hotplug: guard more declarations by CONFIG_MEMORY_HOTPLUG
We soon want to pass flags via a new type to add_memory() and friends.
That revealed that we currently don't guard some declarations by
CONFIG_MEMORY_HOTPLUG.
While some definitions could be moved to different places, let's keep it
minimal for now and use CONFIG_MEMORY_HOTPLUG for all functions only
compiled with CONFIG_MEMORY_HOTPLUG.
Wrap sparse_decode_mem_map() into CONFIG_MEMORY_HOTPLUG, it's only called
from CONFIG_MEMORY_HOTPLUG code.
While at it, remove allow_online_pfn_range(), which is no longer around,
and mhp_notimplemented(), which is unused.
Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Julien Grall <julien@xen.org> Cc: Kees Cook <keescook@chromium.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Len Brown <lenb@kernel.org> Cc: Leonardo Bras <leobras.c@gmail.com> Cc: Libor Pechacek <lpechacek@suse.cz> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Roger Pau Monné <roger.pau@citrix.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Wei Liu <wei.liu@kernel.org> Link: https://lkml.kernel.org/r/20200911103459.10306-4-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Fri, 16 Oct 2020 03:08:33 +0000 (20:08 -0700)]
kernel/resource: move and rename IORESOURCE_MEM_DRIVER_MANAGED
IORESOURCE_MEM_DRIVER_MANAGED currently uses an unused PnP bit, which is
always set to 0 by hardware. This is far from beautiful (and confusing),
and the bit only applies to SYSRAM. So let's move it out of the
bus-specific (PnP) defined bits.
We'll add another SYSRAM specific bit soon. If we ever need more bits for
other purposes, we can steal some from "desc", or reshuffle/regroup what
we have.
Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Julien Grall <julien@xen.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Len Brown <lenb@kernel.org> Cc: Leonardo Bras <leobras.c@gmail.com> Cc: Libor Pechacek <lpechacek@suse.cz> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Roger Pau Monné <roger.pau@citrix.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Wei Liu <wei.liu@kernel.org> Link: https://lkml.kernel.org/r/20200911103459.10306-3-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Fri, 16 Oct 2020 03:08:28 +0000 (20:08 -0700)]
kernel/resource: make release_mem_region_adjustable() never fail
Patch series "selective merging of system ram resources", v4.
Some add_memory*() users add memory in small, contiguous memory blocks.
Examples include virtio-mem, hyper-v balloon, and the XEN balloon.
This can quickly result in a lot of memory resources, whereby the actual
resource boundaries are not of interest (e.g., it might be relevant for
DIMMs, exposed via /proc/iomem to user space). We really want to merge
added resources in this scenario where possible.
Resources are effectively stored in a list-based tree. Having a lot of
resources not only wastes memory, it also makes traversing that tree more
expensive, and makes /proc/iomem explode in size (e.g., requiring
kexec-tools to manually merge resources when creating a kdump header. The
current kexec-tools resource count limit does not allow for more than
~100GB of memory with a memory block size of 128MB on x86-64).
Let's allow to selectively merge system ram resources by specifying a new
flag for add_memory*(). Patch #5 contains a /proc/iomem example. Only
tested with virtio-mem.
This patch (of 8):
Let's make sure splitting a resource on memory hotunplug will never fail.
This will become more relevant once we merge selected System RAM resources
- then, we'll trigger that case more often on memory hotunplug.
In general, this function is already unlikely to fail. When we remove
memory, we free up quite a lot of metadata (memmap, page tables, memory
block device, etc.). The only reason it could really fail would be when
injecting allocation errors.
All other error cases inside release_mem_region_adjustable() seem to be
sanity checks if the function would be abused in different context - let's
add WARN_ON_ONCE() in these cases so we can catch them.
[natechancellor@gmail.com: fix use of ternary condition in release_mem_region_adjustable] Link: https://lkml.kernel.org/r/20200922060748.2452056-1-natechancellor@gmail.com Link: https://github.com/ClangBuiltLinux/linux/issues/1159 Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kees Cook <keescook@chromium.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Julien Grall <julien@xen.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Len Brown <lenb@kernel.org> Cc: Leonardo Bras <leobras.c@gmail.com> Cc: Libor Pechacek <lpechacek@suse.cz> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: "Oliver O'Halloran" <oohall@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Roger Pau Monn <roger.pau@citrix.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Wei Liu <wei.liu@kernel.org> Link: https://lkml.kernel.org/r/20200911103459.10306-2-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Fri, 16 Oct 2020 03:08:23 +0000 (20:08 -0700)]
mm/memory_hotplug: mark pageblocks MIGRATE_ISOLATE while onlining memory
Currently, it can happen that pages are allocated (and freed) via the
buddy before we finished basic memory onlining.
For example, pages are exposed to the buddy and can be allocated before we
actually mark the sections online. Allocated pages could suddenly fail
pfn_to_online_page() checks. We had similar issues with pcp handling,
when pages are allocated+freed before we reach zone_pcp_update() in
online_pages() [1].
Instead, mark all pageblocks MIGRATE_ISOLATE, such that allocations are
impossible. Once done with the heavy lifting, use
undo_isolate_page_range() to move the pages to the MIGRATE_MOVABLE
freelist, marking them ready for allocation. Similar to offline_pages(),
we have to manually adjust zone->nr_isolate_pageblock.
David Hildenbrand [Fri, 16 Oct 2020 03:08:19 +0000 (20:08 -0700)]
mm: pass migratetype into memmap_init_zone() and move_pfn_range_to_zone()
On the memory onlining path, we want to start with MIGRATE_ISOLATE, to
un-isolate the pages after memory onlining is complete. Let's allow
passing in the migratetype.
Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michel Lespinasse <walken@google.com> Cc: Charan Teja Reddy <charante@codeaurora.org> Cc: Mel Gorman <mgorman@techsingularity.net> Link: https://lkml.kernel.org/r/20200819175957.28465-10-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Fri, 16 Oct 2020 03:08:15 +0000 (20:08 -0700)]
mm/page_alloc: drop stale pageblock comment in memmap_init_zone*()
Commit ac5d2539b238 ("mm: meminit: reduce number of times pageblocks are
set during struct page init") moved the actual zone range check, leaving
only the alignment check for pageblocks.
Let's drop the stale comment and make the pageblock check easier to read.
Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Charan Teja Reddy <charante@codeaurora.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michel Lespinasse <walken@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200819175957.28465-9-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Hildenbrand [Fri, 16 Oct 2020 03:08:11 +0000 (20:08 -0700)]
mm/memory_hotplug: simplify page onlining
We don't allow to offline memory with holes, all boot memory is online,
and all hotplugged memory cannot have holes.
We can now simplify onlining of pages. As we only allow to online/offline
full sections and sections always span full MAX_ORDER_NR_PAGES, we can
just process MAX_ORDER - 1 pages without further special handling.
The number of onlined pages simply corresponds to the number of pages we
were requested to online.
While at it, refine the comment regarding the callback not exposing all
pages to the buddy.
Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Charan Teja Reddy <charante@codeaurora.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michel Lespinasse <walken@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200819175957.28465-8-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>